#Iris - A Journey through OpenGL and beyond to learn Graphics

1 messages · Page 12 of 1

wicked notch
#
for view in views
  cull(); // uses previous' frame HZB
  draw();
mark_pages();
make_hzb();```
delicate rain
#

oh you mean latency in the hzb?

wicked notch
#

yeah

#

actually

#

I'm stupid

#

we have the frame of latency regardless

#

epic

delicate rain
#

I probably won't matter much anyways

wicked notch
#

yeah

#

alright

#

time to write

delicate rain
#

my man is speedrunning this

wicked notch
#

I am

#

it's my life

#

it's now or never

#

I ain't gonna live forever 🎶

wicked notch
#

(it's a reference to Bon Jovi's Livin' on a Prayer song, in case someone doesn't catch that)

frank sail
#

I thought it was a reference to another song but ye

wicked notch
#
vec4 project_screen_aabb(in aabb_t aabb, in mat4 transform, in mat4 proj_view) {
    const vec3[] corners = vec3[](
        vec3(aabb.min.x, aabb.min.y, aabb.min.z),
        vec3(aabb.min.x, aabb.min.y, aabb.max.z),
        vec3(aabb.min.x, aabb.max.y, aabb.min.z),
        vec3(aabb.min.x, aabb.max.y, aabb.max.z),
        vec3(aabb.max.x, aabb.min.y, aabb.min.z),
        vec3(aabb.max.x, aabb.min.y, aabb.max.z),
        vec3(aabb.max.x, aabb.max.y, aabb.min.z),
        vec3(aabb.max.x, aabb.max.y, aabb.max.z)
    );
    vec2 min_xy = vec2(1.0);
    vec2 max_xy = vec2(0.0);
    for (uint i = 0; i < 8; ++i) {
        const vec4 clip = proj_view * transform * vec4(corners[i], 1.0);
        const vec2 ndc = clamp(clip.xy / clip.w, -1.0, 1.0);
        const vec2 uv = ndc * vec2(0.5, -0.5) + 0.5;
        min_xy = min(min_xy, uv);
        max_xy = max(max_xy, uv);
    }
    return vec4(min_xy, max_xy);
}

bool is_meshlet_visible(in vec4 box_uvs) {
    const vec2 min_xy = box_uvs.xy;
    const vec2 max_xy = box_uvs.zw;
    const vec2 hzb_size = vec2(imageSize(u_vsm_hzb));
    const float max_mip = 1 + floor(log2(max(hzb_size.x, hzb_size.y)));
    const float width = (max_xy.x - min_xy.x) * hzb_size.x;
    const float height = (max_xy.y - min_xy.y) * hzb_size.y;
    const float mip = clamp(ceil(log2(max(width, height))), 0, max_mip);
    const bvec4 samples = bvec4(
        bool(texelFetch(u_vsm_hzb, ivec2(box_uvs.xy * hzb_size), int(mip)).x),
        bool(texelFetch(u_vsm_hzb, ivec2(box_uvs.zy * hzb_size), int(mip)).x),
        bool(texelFetch(u_vsm_hzb, ivec2(box_uvs.xw * hzb_size), int(mip)).x),
        bool(texelFetch(u_vsm_hzb, ivec2(box_uvs.zw * hzb_size), int(mip)).x)
    );
    return any(samples);
}``` this be kinda weird
wicked notch
#

583 device lost and counting

frank sail
#

BORN TO TDR / GPU IS A FUCK / Crash Em All 1989 / I am trash man / 410,757,864,530 LOST DEVICES

wispy spear
#

lol

wicked notch
#

so first culling attempt

#

still takes a fuckton of time to rasterize

#

the regular raster path takes 47 microseconds (without culling lol), VSM raster takes 1m (with culling)

#

And the VSM hardware raster shader is sus

#

But I don't see anything extremely slow with it, except the likely divergence lol

#

How loading gl_FragCoord cause a short scoreboard dependency is beyond me

frank sail
#

can you show PTX in the right side

#

SPIR-V is very fake

wicked notch
#

hmm I don't see how to show it

frank sail
#

rip

#

I forgor you can't see real assembly

wicked notch
#

can AMD hardware help here

#

maybe show how AMD compiles this shader

frank sail
#

download RGP

wicked notch
#

there you go AMD man

frank sail
#

I guess gl_Position is stored in four SGPRs which have almost no latency to load

#

so uh I guess that won't cause a stall on AMD

#

idk what NV is doing, but it's probably similarish (they don't have SGPRs, but gl_Position should come from some low-latency memory somewhere)

wicked notch
#

so how fix

frank sail
#

ok actually that s_load_b128 is loading gl_Position into four SGPRs from memory somewhere (s[0:1] is two SGPRs holding an address)

frank sail
#

I'm guessing the source mapping is just misinfo

wicked notch
#

epic

#

caching time then

#

"The best rasterization technique is no rasterization"
~Sun Tzu

wicked notch
#

how is PROP the top sol when it handles early-late Z/depth testing and blending

#

all of which are disabled in the VSM pipeline bleakekw

frank sail
#

the tools are breaking down on us brother

#

we must finish this journey alone

wicked notch
#

I'm so conchfused rn bleakekw

frank sail
#

if it makes you feel any better, my hzb is broken and idk why

glass sphinx
#

Btw since turing nvidia has uniform registers, which are nearly identical to sgprs on amd afaik

#

they are also integer only

frank sail
#

wdym integer only

#

like they store unformatted data?

glass sphinx
#

the sgpr operations can not operate on floating point

frank sail
#

well that's all registers

glass sphinx
#

i guess they can load whatever

frank sail
#

ok

#

so you have to load them into vector registers to do floating point meth

glass sphinx
#

yea thats the same on amd

#

i had to learn that the hard way at work

#

adding one mul made the vgpr use explode by 12

#

🥸 sometimes compiler optimizations bite my ass when they suddenly get turned off by some bingus

frank sail
# glass sphinx yea thats the same on amd

Scalar ALU (SALU) instructions operate on values that are common to all work-items in the wave. These
operations consist of 32-bit integer or float arithmetic, and 32- or 64-bit bit-wise operations. The SALU also can
perform operations directly on the Program Counter, allowing the program to create a call stack in SGPRs.
Many operations also set the Scalar Condition Code bit (SCC) to indicate the result of a comparison, a carry-out,
or whether the instruction result was zero.

glass sphinx
#

i an asure you amd can not do floating point ops with the scalar alu

frank sail
#

the official RDNA 3 ISA guide

#

you may have been looking at RDNA 1 or 2

glass sphinx
#

afaik rdna 3 also doesnt have it yet

#

hmm

frank sail
#

what makes you say that?

glass sphinx
#

i believe i searched for it in the isa before

#

how are they named

#

i cant find any s_XXX_f32 instructions

#

I cant find a single scalar f32 instruction 😦

frank sail
#

I guess my snippet is misinfo because neither can I

glass sphinx
#

Scalar ALU (SALU) instructions operate on values that are common to all work-items in the wave. These
operations consist of 32-bit integer or float arithmetic, and 32- or 64-bit bit-wise operations. The SALU also can
This is really specific on saying float arithmetic works huuuuuuuuh

#

i guess its just very unfortunate wording

frank sail
#

the registers can be used for VALU ops but ye

glass sphinx
#

mhmm

#

but there are rumors that rdna 4 gets float support on the salu

#

that would be sick

wicked notch
#

conchfusion levels are reaching heights I didn't think were possible

#

@frank sail if you disable caching in your thing, what does the GPU trace look like

frank sail
#

99% vsm drawing 1% everything else

#

ok let me give you better info

wicked notch
#

ye can you show the trace

frank sail
#

oops one sec

#

got distracted with manmade horrors in #questions

#

hold up imma delete that pic

#

better crop + relevant pass is selected so the counters are accurate

wicked notch
#

Right yeah this looks like a healthy trace

#

pixel warps very high and top SOL = SM

wicked notch
frank sail
#

hmm

#

I'm drawing every vsm btw

wicked notch
#

yeah same

#

let me get the simple scene

frank sail
#

I gotta schleep but I'll talk l9er

frank sail
#

Which I guess has low geometry complexity but whatever

#

I wonder if low geometry density is the Achilles' heel of this since there will be too much overdraw

#

Not overdraw but rather fs invocations

wicked notch
#

baffling to say the least

#

is it mesh shaders?

#

somehow mesh shaders are garbage for this?

#

I dunno

#

let me try something

frank sail
#

Look at what the active warps are

#

For me it was 99.9% PS warps

wicked notch
#

For me it's 30% PS warps with an occasional spike to 60%

frank sail
#

Btw I hypothesize that reducing the VSM resolution could improve perf as we'll have far fewer fragments for geometry hanging off the edge of the visible bounds

#

As we've learned, only a small fraction of the VSM is in use at a time which makes it a viable change

frank sail
# wicked notch

Did u make sure to select that pass so we're looking at the right counters

#

If so, then those are spooky numbers, Mason

wicked notch
#

uhh one sec

#

there it is

#

spooky numbers indeed

raven orchid
#

The other option would bring back readback

frank sail
#

Yeah or even smaller if possible

#

True bleakekw

raven orchid
#

I’m using 8k personally

wicked notch
#

I'll let my brain machine work

#

you go schlepp jaker, you don't have to suffer with me bleakekw

frank sail
#

Hold on I just had an unhinged idea

#

Are 8192^2 textures a thing that we can make

#

We could make an 8 bit stencil texture and use that for early-stencil

wicked notch
#

that is so unhinged I have no idea how that works 💀

raven orchid
#

Oh dang that brings back week 1 vsm memories

frank sail
#

But the beauty of it is that you just need one total, as long as you're okay with

foreach view:
  CullMeshlets(view);
  Draw(view);
wicked notch
#

why is a 8192 texture required tho

frank sail
#

Now it's a question of whether the early-s will actually save any perf

frank sail
wicked notch
#

I severely doubt that

raven orchid
#

We asked froyo that I think

#

Answer was…. Crap

#

Something

frank sail
#

Hmm

raven orchid
frank sail
#

I'm bottlenecked by PS invocations so it'd be nice to skip a bunch of them

#

Not as good as viewport culling, but it's perhaps something

raven orchid
#

Can you resize a viewport using the gpu

frank sail
frank sail
raven orchid
#

Crap

frank sail
#

Device generated commands

For each draw in a sequence, the following can be specified:

a different shader group

a number of vertex buffer bindings

a different index buffer, with an optional dynamic offset and index type

a number of different push constants

a flag that encodes the primitive winding
#

Rip formating

raven orchid
#

Hmmm

#

So this issue

#

Is it that too many meshlets are spilling into non dirty or non resident pages so there is wasted work?

frank sail
#

Setting the front face from the gpu seems useless

frank sail
#

It should only be bad for large meshlets, however

wicked notch
#

We both use 64/64 ultra small meshlets sir

frank sail
#

I mean physically lol

wicked notch
#

Hmmm

frank sail
#

Low geometry density is the problem

wicked notch
#

oh yeah

#

yes

#

the solution is obviously subdividing the geometry

frank sail
#

We need nanite 1 tri/px

wicked notch
#

damn the small indie company Epic Games really did think it through

frank sail
#

Other solutions

  • per triangle culling
  • per triangle clipping against min/max
  • sw rast
raven orchid
#

So we just need to take a small detour and implement full nanite

#

Then our vsms are done

wicked notch
#

well that was in the todo list for me KEKW

#

I "just" need to figure out graph partitioning

frank sail
#

sw rast might actually be viable here

#

I can just lift lvstri's impl frogeheart

raven orchid
#

Also others:

  • readback to modify viewport but 1-3 frames behind
  • only render 1/4 of each clip per frame
#

Wait so what kind of speed up did hzb give?

wicked notch
#

My ultra naive shrimple HZB did a pretty good job at removing wasted mesh shader invocations

#

but well

#

it's still ultra bad

frank sail
#

But preliminary results say: beeg

wicked notch
#

ye it's overall beeg

#

but not enough™️

raven orchid
#

Other option is like

frank sail
#

Btw another solution

  • increase the size of the smallest VSMs so meshlets aren't big compared to them
raven orchid
#

How tiny do we need the base clip map

#

Oh yeah that’s where I was going right now lol

#

My confession is I never wanted vsm quality shadows

#

I just wanted csm but with sparse caching

wicked notch
#

heresy

frank sail
#

Have you heard of parallax corrected cached shadows

raven orchid
#

So I toned down my base clip map lol

#

Sounds familiar

frank sail
#

Tldr solution for the sun rotation with cached shadows

#

You can still have it rotate in huge quantized steps, but you do a ray marching hack to find where the shadow "would be" if the sun was at another angle

raven orchid
#

Does that basically let you reuse old cached data?

frank sail
#

So you can get the appearance of smooth rotation while rendering the shadows very infrequently

raven orchid
#

Oh wow that actually sounds badly needed

frank sail
#

It's only in gpu zen 2

#

But you can Google it and find the online bits

frank sail
#

I think far cry 5 (the game that uses them iirc) only did the parallax thing for their long distance adaptive shadows

#

They did regular csm up to like 30m

raven orchid
#

I still think vsm has the potential to replace csm fully

wicked notch
#

caching + good culling has potential

#

but caching + good culling + 1 tri/px should be best

#

you just need a GPU capable of rasterizing SCREEN_RESOLUTION triangles N times bleakekw

raven orchid
#

Caching + good culling + viewport resize + temporal shadow update budget + parallax corrected + stencil rejection (maybe???)

wicked notch
#

extremely tedious

raven orchid
#

And yeah if you just go full nanite that’ll be the best

wicked notch
#

but possible KEKW

frank sail
#

I'm working with 1 resolution of triangles

wicked notch
#

ye they do LOD per clipmap

#

and then do SSS to fix discrepancies

frank sail
#

Tf

frank sail
wicked notch
#

yes I'm a SHADOWMAP enjoyer
S help
H orror
A re
D neverending
O this is a legitimate call for help
W
M
A
P

raven orchid
#

Shadow quality is still way better than csm for the same perf

#

So despite the nightmare fuel I think vsm is still worth it

frank sail
#

God it better be

raven orchid
#

4 cascades at 8k with csm killed my gpu

#

But vsm it actually works lol

wicked notch
frank sail
raven orchid
#

That’s how I’m using lmao

#

Like my shadow bias is now 0

frank sail
#

Tfw no 30 clipmaps at 16k res 😔

raven orchid
#

Consistent across all scenes

#

Actually I can do more than 4 and perf stays the same

#

Probably 10 would be doable

frank sail
raven orchid
#

Well I still do normal vector offsetting during sampling

frank sail
#

O

raven orchid
#

But slope biasing I removed

frank sail
#

You still need a slope bias hmm

#

Or do you mean polygon offset is gone

raven orchid
#

Yeah that one

frank sail
#

There is a formula to calculate the exact bias needed but I couldn't get it to work so I just added more constants bleakekw

#

I'll get to it later lol

#

Along with proper filtering/10k lines of SMRT

#

Anyways I gotta sleep fr

wispy spear
#

hmm

#

not sure if this is of any interest to you my frog

delicate rain
#

Oh the castle looks awesome

wispy spear
#

yeah

#

thats how the castle thing in UE4 demo should have looked like xD

delicate rain
#

Lmao that would have been insane

wispy spear
#

when you watch it now, its quite underwhelming somehow

delicate rain
#

Really? Those are some standards you have mister

wispy spear
#

it was stunning back then

wicked notch
#

ok so

#

I have been stomped on the great PROP incident for the past week

#

I have zero clue as to what to do besides implement caching

#

but that just sounds like a bandaid to some ultra weird underlying problem

frank sail
#

You can always pause this and begin caching and hzb

#

More like h"z"b since there isn't a z

wicked notch
#

yeah I figure that's all I can do

#

but caching still scares me

#

whatever Saky did was terrifying bleakekw

frank sail
#

Ah

#

I think you can do something shrimpler, like what I did

#

(which still took me weeks to implement because smooth brain)

wicked notch
#

ye perchance

frank sail
#

first step is getting the stable addresses

#

So you can get sharp shadows everywhere, but while keeping the page addresses the same

#

Second step is making sure pages are marked correctly (dirty pages should only be the ones that were just alloc'd this frame, pages that were already visible and continue to be visible should be untouched)

#

Third step is making the pyramid of dirty pages and culling against it like an hzb

wicked notch
#

nice n easy

#

what about the snapping

frank sail
#

It took me like a week but it's literally 10-20 loc

#

That's part of step 1

#

It hurts my fingers to write how it works on mobile

#

(If there is no rotation)

#

Basically this function computes

  1. A view matrix that is snapped to the grid to use for rendering the shadow map
  2. An offset to apply in the fragment shader of rendering the shadow map which "corrects" the page address. This is needed because the fragment shader only knows about window space (gl_FragCoord)
#

Every other lookup into the VSM uses the stable viewproj, which is placed at the origin, and fract(uv)

wicked notch
#

hmmmm

#

the brain's working

frank sail
#

If I could draw a pic

#

But I'm on mobile bleakekw

raven orchid
#

Idk what it stands for

raven orchid
#

Virtual occlusion culling, voc

#

Idk

frank sail
#

page overdraw gulling (pog)

raven orchid
#

I like it

#

“In order to improve performance we introduce hierarchical pog”

frank sail
#

virtual undesirables kulling (vuk)

#

Maybe just hierarchical page culling (hpc)

#

But it isn't culling pages rgbemojiwiggled

#

"Hierarchical page buffer" can't be misinterpreted probably

raven orchid
#

Yeah true I don’t think hpb will be misinterpreted

#

Hopefully

frank sail
#

As long as it has "hierarchical" in the name tbh

wicked notch
#

rendering VSMs for me takes a huge amount of time, the profiler just tells me "idk what it is but the PROP unit is dying"

raven orchid
#

Oh weird

#

Full res 16k?

wicked notch
#

16k viewport yeah

frank sail
#

FYI mine just renders 4k now (but 16k should be okay after the new culling)

wicked notch
#

4k?

#

As in 4k virtual shadow map?

frank sail
#

Yeah

#

Instead of 16k

wicked notch
#

hm

#

what happens if you use 16k

frank sail
#

It doesn't affect correctness at all except at very high resolutions

frank sail
#

Before culling, it worked fine but with worse perf

#

Now I suspect the perf hit is much smaller

wicked notch
#

do you have the thing 🅱️ushed

frank sail
#

Yes, but there is still a bug

#

A 1 liner if you want to pull

#

Oh wait I didn't push that bug

#

Hmm hold on, lemme boot and push (might be a few)

wicked notch
#

smh imagine not pushing bugs

frank sail
#

Well I committed it, just haven't pushed kekwfroggified

frank sail
#

@wicked notch you can pull now

#

it probably works on linux

wicked notch
#

btw

#

but since ndc coords can go beyond [-1;1] this is potentially not the actual min isn't it?

#

perhaps a fract is needed?

frank sail
#

Oh

#

Yeah there is no actual min and max, you understood right

#

They should be infinity and negative infinity to start I guess

frank sail
#

I use a repeat sampler in this case

#

Though fract could be used instead

delicate rain
#

That is if you want to do the shrimple Jaker view

#

Not sure if I tagged the correct message 😅

wicked notch
#

ahhh yes

#

branch

#

we all know the typical branching operation: storing a value

wicked notch
#

guys I think I'm going insane

#

gl_FragCoord is in window space right?

#

That means if I am rasterizing a 16384^2 viewport, gl_FragCoord should be in range [0;16384) right?

delicate rain
#

Yeah

wicked notch
#

ok

#

Apparently I had a massive bug

#

but the thing just worked because some genius decided that 16384 / 128 = 128

delicate rain
#

Lmao I'm always scared of changing the numbers I have in my defines

#

Because it's 50% chance of breaking every piece of math that was previously aligned

wicked notch
#

ye that's exactly what happened

wicked notch
#

as it turns out

#

rendering twelve 16k shadow maps is hard

#

and as it turns out again

#

halving the resolution unbounded me from the PROP

wicked notch
#

Nsight crashed my computer

#

thank you nvidia

#
.capacity = (1 << 24) >> 4, // 2^24 meshlets, packed```
#

I could've just written 1 << 20

#

but this is funnier

wispy spear
#

: D

wicked notch
#

not bad

#

albeit the culling is slightly off

wispy spear
#

SHRTS should be SHITS

wicked notch
#

I am very much not confident in this:

bool project_screen_aabb(in aabb_t aabb, in mat4 transform, in mat4 proj_view, out vec4 box_uvs) {
    const vec3[] corners = vec3[](
        vec3(aabb.min.x, aabb.min.y, aabb.min.z),
        vec3(aabb.min.x, aabb.min.y, aabb.max.z),
        vec3(aabb.min.x, aabb.max.y, aabb.min.z),
        vec3(aabb.min.x, aabb.max.y, aabb.max.z),
        vec3(aabb.max.x, aabb.min.y, aabb.min.z),
        vec3(aabb.max.x, aabb.min.y, aabb.max.z),
        vec3(aabb.max.x, aabb.max.y, aabb.min.z),
        vec3(aabb.max.x, aabb.max.y, aabb.max.z)
    );
    vec2 min_xy = vec2(1.0);
    vec2 max_xy = vec2(0.0);
    for (uint i = 0; i < 8; ++i) {
        const vec4 clip = proj_view * transform * vec4(corners[i], 1.0);
        if (clip.w <= 0.0) {
            return false;
        }
        const vec2 ndc = fract(clip.xy / clip.w);
        const vec2 uv = ndc * vec2(0.5, -0.5) + 0.5;
        min_xy = min(min_xy, uv);
        max_xy = max(max_xy, uv);
    }
    box_uvs = vec4(min_xy, max_xy);
    return true;
}
#

specifically the const vec2 ndc = fract(clip.xy / clip.w); part

wispy spear
#

hmm would it help to visualize the values?

wicked notch
#

they're kinda difficult to visualize

#

but perchance

#

actually they may not be difficult at all

wispy spear
#

maybe with a little debug switch in there so that you can toggle that on and off from outside

distant lodge
#

when I was having culling trouble I rendered out debug quads of the depth buffer linearized

#

so I could see what was there

#

(debug quads of the projected sphere)

wicked notch
#

well this is..

#

a mess

frank sail
#

what da views doin

wicked notch
#

ight caching time

wicked notch
#

I be wondering

#

if before caching

#

I should reimplement my nanite

#

just to see if it actually benefits

frank sail
#

we do a little detouring

wicked notch
#

I have all the code ready, I just have to mash it together

#

and make it work (optionally)

delicate rain
#

LVSTRI procrastinating caching I'm procrastinating culling - I like the dynamic

frank sail
#

uh methinks you should do the naniteisms later

wicked notch
#

ye perchance

wicked notch
#

man brain fog is real

#

me: where the fuck is my phone
also me: has phone in my fucking hands with the flashlight on

#

early onset dementia fr

wispy spear
#

lol happened to me recently as well, was under the desk to fiddle cables in the darkness 😄

#

#youarenotalone

raven orchid
#

i have a question

#

how long did it take to implement lod 0 sw rasterizer in compute for virtual geom?

wicked notch
#

A week maybe

#

because I am dumb

#

You can find my endless struggles in this thread bleakekw

raven orchid
#

dang that's pretty fast tho

#

what was or is performance like?

wicked notch
#

I did only visbuf

#

but the gains were there

#

up to 3x as Unreal tested

raven orchid
#

interesting ok

#

guess the stuff in wip has me in a weird place

frank sail
#

It has left me feeling shaken

raven orchid
#

shaken belief in a culling solution?

wicked notch
#

The early Z almost made a comeback

#

rip

frank sail
#

devsh mentioned interpolateAtOffset which is viable

wicked notch
#

ye

#

I gotta render some quads fr

frank sail
#

actually you'd have to write to gl_FragDepth which would kill early z on most hw anyway

#

meh

#

actually

#

yeah just render some quads hehe

wicked notch
#

wait why would you write to gl_FragDepth

frank sail
#

you don't

wicked notch
#

o

frank sail
#

I was trippin fr

wicked notch
#

we good then?

frank sail
#

reasonably

#

just one quad per active page to populate the depth buffer

wicked notch
#

how do I make the quad

#

2 / page_size?

frank sail
#

I guess

#

Actually quads are inefficient

#

Just make a point

wicked notch
#

page sized point?

frank sail
#

uh hold on lemme think

#

I guess a quad actually makes sense if it needs to cover more than one sample

#

Rects would be better but only NV supports those agonyfrog

#

Quads will suffice

wicked notch
#

hmm do we need to cover more than one shrimple

#

I'm approaching this the same way as ROC

frank sail
wicked notch
#

but hold on

#

how do I get the quad's depth?

#

unless I render the projected meshlet's AABB?

frank sail
#

I was thinking it would be 1 for active pages and 0 for inactive pages

#

so early-z would just cull fragments in inactive pages

wicked notch
#

Hmm that would be reasonable

frank sail
#

when u actually render the geometry, the regular depth will be tested against that

wicked notch
#

when does MSAA come into play though thonk

#

I assume when rendering the actual VSM

frank sail
#

it doesn't, not anymore

wicked notch
#

rip msaa

frank sail
#

devsh mentioned that it's only for when depth has higher samples than color which is the opposite of what he proposed

#

according to the spec

wicked notch
#

ah rip

frank sail
#

but we can use interpolateAtOffset anyways so MSAA always sucked

wicked notch
#

so we declaring gl_FragCoord varying?

frank sail
#

is that real syntax

wicked notch
#

maychance

frank sail
#

in vec4 gl_FragCoord; or something

#

at the very least, we can declare our own window-space vs output

wicked notch
#

ye

#

maybe that's better

frank sail
#

also, don't we need conservative raster for this to work

#

if each fs invocation is to shade multiple pixels

wicked notch
#

you know what

#

Imma draw a fullscreen tringle

#

and check whether I should output a depth of 1 or 0 in the frag shader KEKW

frank sail
#

it's either that or clear+quad per active page

#

I think the latter would be faster in cases where there are few active pages

frank sail
wicked notch
#

hmm I don't really have a real grasp on how conservative raster works

#

The only time I read about it was for HFTS

#

Nvidia's bullshit technique for shadows KEKW

frank sail
#

it means a fragment is generated if the triangle touches a texel rather than if it covers the center

#

it can also mean a fragment is generated if the triangle covers the entire texel, depending on the mode

#

at least AMD and NV support the ext in vulkan

#

but in GL, only NV supports it kekkedsadge

wicked notch
#

damn rip

#

I don't really see it though (the reason for conservative raster)

frank sail
#

time for triangle dilation geometry shader bleakekw

distant lodge
#

High Fructose Torn Syrup

#

you also use it for voxelizing, iirc devsh mentioned the reason you'd want it in that case over msaa voxelization

#

I just forgot

frank sail
#

so you only need to fill a 128x128 depth buffer to get early z

#

a triangle could cover a significant amount of a page without actually covering the center sample, but obviously you care about all the texels in the page being written

#

but if the triangle doesn't cover the center sample, nothing gets written unless you use conservative raster

#

realistically, you'd only do like 4x4 or 8x8 pixels in the fs so the error wouldn't be so large, but it would still be less than ideal

wicked notch
#

wait hold on

distant lodge
#

why do you wanna manually dilate though, with NV at least the conservative rasterization ext is available since maxwell afaik

wicked notch
#

I thought we were using a depth buffer as big as the VSM but 16 bit?

distant lodge
#

(which still blows me out but what are you gonna do)

frank sail
frank sail
wicked notch
#

I don't think anything older than maxwell should be legally allowed to run VSM

frank sail
#

16-bit depth is ass though

wicked notch
#

we only store a 1 or a 0 it's fine

frank sail
#

well I guess if it's just being used to store a binary value then ye

#

lol

distant lodge
#

why not use a stencil buffer

frank sail
#

almost forgot that we actually have two depth buffers bleakekw

distant lodge
#

can't those have 1 bit fidelity

#

or storage I should say

frank sail
#

no but they can have 8 bit storage

#

I'm kinda lukewarm on the whole idea tbh

wicked notch
#

is there early stencil?

frank sail
#

yes

wicked notch
#

pog

#

I've never rendered to a stencil buffer ever in my life

frank sail
#

same

wicked notch
#

I don't even know how you actually render into one bleakekw

distant lodge
#

I thinnk you set the read/write ops in a similar way to configuring depth ops

#

they just work differently

wispy spear
#

glStencilMask similar to glDepthMaskisms?

#

and glColorMask(false, false, false, false)

frank sail
#

probably

wicked notch
#
typedef struct VkStencilOpState {
    VkStencilOp    failOp;
    VkStencilOp    passOp;
    VkStencilOp    depthFailOp;
    VkCompareOp    compareOp;
    uint32_t       compareMask;
    uint32_t       writeMask;
    uint32_t       reference;
} VkStencilOpState;``` wtf is this
frank sail
#

btw this all only helps the case where every page isn't being drawn to, i.e., when culling is also working

wicked notch
#

and it's definitely more conservative than necessary

frank sail
#

there is a learnopengl tutorial for stencil KEKW

wicked notch
#

back to our origins huh

frank sail
#

I'm more tempted to try gl_{Clip, Cull}Distance before this tbh

wispy spear
#

when im back from memory transfers it better works

frank sail
#

filling a big chungus stencil buffer for every draw sounds cap

#

also you literally can't even make the stencil buffer for some of the resolutions we target

wicked notch
#

if only we could do MSAA

wispy spear
#

can you use a ubo/ssbo in a smol compute pass in between instead?

frank sail
#

unless you do the fake MSAA thing with interpolateAtOffset (which requires conservative raster)

wispy spear
#

you want to write something to the stencil buffer

#

can this be replaced with a ubo/ssbo instead

wicked notch
#

no because we lose early stencil then

#

rip

wispy spear
#

ah thats also a thing

frank sail
#

yeah the goal is to not emit fragment shader invocations somehow

frank sail
#

well maybe there's a way to change the planes in that case

#

basically there will either be a large void in the middle that you don't want to render to, or a small square in the middle that you do want to render to

#

for posterity

#

idk how to efficiently compute the former bounds tbh. you can't do a simple min/max on the page coordinates

wicked notch
#

you could just coerce AMD to make sparse not garbage on windows

#

and all our problems would vanish

#

use deadly force if necessary bleakekw

frank sail
#

guess I'll try culling individual triangles against the page hierarchy

#

then sw rasterize nervous

wicked notch
#

btw I fixed the HZB

#
vec4 project_screen_aabb(in aabb_t aabb, in mat4 transform, in mat4 proj_view) {
    const vec3[] corners = vec3[](
        vec3(aabb.min.x, aabb.min.y, aabb.min.z),
        vec3(aabb.min.x, aabb.min.y, aabb.max.z),
        vec3(aabb.min.x, aabb.max.y, aabb.min.z),
        vec3(aabb.min.x, aabb.max.y, aabb.max.z),
        vec3(aabb.max.x, aabb.min.y, aabb.min.z),
        vec3(aabb.max.x, aabb.min.y, aabb.max.z),
        vec3(aabb.max.x, aabb.max.y, aabb.min.z),
        vec3(aabb.max.x, aabb.max.y, aabb.max.z)
    );
    vec2 min_uv = vec2(+3.402823466e+38);
    vec2[8] semi_uv;
    for (uint i = 0; i < 8; ++i) {
        const vec4 clip = proj_view * transform * vec4(corners[i], 1.0);
        const vec2 uv = (clip.xy / clip.w) * vec2(0.5, -0.5);
        min_uv = min(min_uv, uv);
        semi_uv[i] = uv;
    }
    vec2 min_xy = vec2(0.0);
    vec2 max_xy = vec2(1.0);
    for (uint i = 0; i < 8; ++i) {
        const vec2 uv = fract(semi_uv[i] + min_uv);
        min_xy = min(min_xy, uv);
        max_xy = max(max_xy, uv);
    }
    return vec4(min_xy, max_xy);
}
#

for posterity too

frank sail
#

is this for your boolean hzb?

wicked notch
#

ye

#

also, how did you fix biasing?

#

I am still only applying a mere constant offset bleakekw

frank sail
#

did you see the desmos I spammed

wicked notch
#

so the old usual tan(acos())

#

epic

frank sail
#

there's a linalg version below

#

I impl'd this but it's fucked somehow and I didn't care enough to fix it

#

btw you also need to account for fp error (somehow)

#

I think I just added a constant 1.0 / (1 << 24) or something

cold sky
cold sky
# distant lodge I just forgot

its because MSAA is a grid, just like using a higher resolution render target, so when you voxelize you can still miss a sample and have seams/gaps in your voxelization

wicked notch
#

need your data @frank sail for posterity

wicked notch
#

sir @wispy spear can I share the sparse experiment in #experiments

#

There's a possibility that your system will lockup if you are on NV/Windows so maybe a warning should be necessary

wispy spear
wispy spear
frank sail
#

That's for their ray tracing extensions

#

Opacity and displacement micromap generation and stuff

wispy spear
#

ah so something completely different

#

i saw meshopt showing up in there and thought there was some meshopt-isms going on few weeks or months ago

frank sail
#

Yeah though I get how it invokes the idea of meshlets when it says "micromeshes"

wispy spear
#

ye meshleading

wicked notch
#

I have faith

#

With the new batching & waiting method vkQueueBindSparse does NOT cause hitches

#

when updating lots of clipmaps

#

vkQueueBindSparse's perf still sucks though

#

but it's not as bad

#

I wonder

#

Maybe setting a limit of updated pages per frame won't cause a lot of pop-in if I can keep framerate up?

#

Nsight's sparse image viewer is pretty pog though ngl

wicked notch
#

vkQueueBindSparse + vkQueueWaitIdle KEKW

#

It is entirely possible that my previous setup with the timeline semaphore was wrong

#

I also used the graphics queue for everything, maybe using the transfer or the compute queue + timeline semaphore could give me more perf

wicked notch
#

I just gave up very quickly, this time I'll put in more work to make vkQueueBindSparse not suck and report back

#

Thanks to sharpneli & co. I now have a lot more info

cold sky
wicked notch
#

Ye, previously I was worried about the CPU overhead, but if you saw in the experiments room, if you don't spam vkQueueBindSparse calls the CPU overhead is basically zero or just wait for idle

cold sky
wicked notch
#

Yes but even then, if you call vkQueueBindSparse once per frame, if enough frames pass you have a big issue

#

because the HW queue fills up

cold sky
#

use temporal smoothing / fixed mem budget

wicked notch
#

yeah I have some ideas about that

cold sky
#

one would be to wait for a page to be unused for K frames before evicting

wicked notch
#

I'll first try deferred page unbinding, if there's enough memory in the pool

cold sky
#

btw this strategy is nice because for every bind, you have a matching unbind

#

basically to page something in, you have to page something out

wicked notch
#

Hmm yeah that does sound good

#

also lets me never actually "unbind" (as is, submit a null VkSparseImageMemoryInfo)

#

You can just replace the page with a new offset, saving one bind operation

cold sky
#

btw you can score all pages (Cause there's so few of them) on the GPU and run a GPU compute sort

#

(or CPU radix)

#

and if your mem budget is N pages, then you grab N most important ones

wicked notch
#

what do I do with this score, optimal linear allocation?

#

Ah importance

cold sky
#

figure out which ones need to be paged in the most

#

btw even if a page is not needed, you shouldn't set its importance to 0

#

you can do something like 0-1 is how likely its come into view, and >1 is for visible pages

cold sky
wicked notch
#

nice that's a great heuristic

#

figuring out the most important pages is gonna be a chore though

#

perhaps I can score them using the 1 / clipmap_level and 1 / distance_to_camera heuristics

wicked notch
#

alright I got hardware VSM back in full functionality

#

now with 100% more popin

#

I have reclaimed earlyZ and hardware filtering though

#

Look at ZROP go though KEKW

#

ZROP being abused in unspeakable ways is always fun

delicate rain
#

This is with or without culling?

wicked notch
#

Without

delicate rain
#

16 clips?

wicked notch
#

16 clips ye

delicate rain
#

Wtf that's pretty good

wicked notch
#

It is

#

It's at least twice as fast as software VSM

#

if not more

#

depending on position etc

#

The horrible thing is the frame of latency

#

That is just so bad

delicate rain
#

I didn't really investigate real sparse what is the process you do now/how does it work?

#

(if you don't mind going over it ofc 🥹)

wicked notch
#

Sure thing

#

So first things first, the marking of visible pages remains the same

#

Now though, instead of sending it to another compute shader to allocate pages, we read it back on the CPU

#

Specifically I do this

#
auto bindings = std::vector<ir::sparse_image_memory_bind_t>();
bindings.reserve(requests.size());
for (auto page = 0_u64; page < requests.size(); ++page) {
    const auto offset = ir::offset_3d_t {
        .x = static_cast<int32>((page % IRIS_VSM_VIRTUAL_PAGE_ROW_SIZE) * IRIS_VSM_VIRTUAL_PAGE_SIZE),
        .y = static_cast<int32>((page / IRIS_VSM_VIRTUAL_PAGE_ROW_SIZE) * IRIS_VSM_VIRTUAL_PAGE_SIZE),
    };
    const auto extent = ir::extent_3d_t {
        .width = IRIS_VSM_VIRTUAL_PAGE_SIZE,
        .height = IRIS_VSM_VIRTUAL_PAGE_SIZE,
        .depth = 1,
    };
    if (is_requested && !is_allocated) {
        const auto entry = _allocator.get().allocate();
        bindings.emplace_back(ir::sparse_image_memory_bind_t {
            .offset = offset,
            .extent = extent,
            .buffer = _buffer->slice(memory_offset, IRIS_VSM_PHYSICAL_PAGE_RESOLUTION, false),
        });
        _pages[page] = entry;
        _allocated++;
    } else if (!is_requested && is_allocated) {
        bindings.emplace_back(ir::sparse_image_memory_bind_t {
            .offset = offset,
            .extent = extent,
            .buffer = _buffer->slice(memory_offset, IRIS_VSM_PHYSICAL_PAGE_RESOLUTION, true),
        });
        _allocator.get().deallocate(entry);
    }
}
return bindings;
#

This takes in requests which is the read-back array

#

and outputs an array of sparse_image_memory_bind_t

#

Which is equivalent to VkSparseImageMemoryBind

#

It basically takes in the offset and extent of the our virtual image and assigns it to a certain offset of a VkDeviceMemory

#

The next step is to send this info over to vkQueueBindSparse which is just this

for (auto i = 0_u32; i < IRIS_VSM_CLIPMAP_COUNT; ++i) {
    const auto requests = vsm_visible_pages_buffer.as_span();
    // std::vector<ir::sparse_image_memory_bind_t>
    const auto bindings = _vsm.images[i].make_updated_sparse_bindings(requests.subspan(
        i * IRIS_VSM_VIRTUAL_PAGE_COUNT,
        IRIS_VSM_VIRTUAL_PAGE_COUNT
    ));
    if (!bindings.empty()) {
        _sparse_bind_info.image_binds.emplace_back(ir::sparse_image_memory_bind_info_t {
            .image = std::cref(_vsm.images[i].image()),
            .bindings = std::move(bindings),
        });
    }
}```
#

And then it's just a good ol' vk call

#
sparse_timeline_value = _sparse_bind_semaphore->increment(2);
_sparse_bind_info.wait_semaphores = {
    { std::cref(*_sparse_bind_semaphore), {}, sparse_timeline_value }
};
_sparse_bind_info.signal_semaphores = {
    { std::cref(*_sparse_bind_semaphore), {}, sparse_timeline_value + 1 }
};
_device->compute_queue().bind_sparse(_sparse_bind_info);```
#

I use timeline semaphores such that:

  • if there were no previous sparse binds, simply signal to the graphics queue once it is done
  • If there was a previous sparse bind, wait for it to be done and signal the graphics queue
#

Finally, sampling is super easy

#
float sun_shadow = 0.0;
const int vsm_residency = sparseTextureLodARB(u_vsm[virtual_page.position.z], virtual_page.uv.xyz, 0, sun_shadow);
if (!sparseTexelsResidentARB(vsm_residency)) {
    sun_shadow = 1.0;
}```
delicate rain
#

Hmmm so the latency is only due to the readback?

wicked notch
#

Yep

#

I can't even do anything about it because I need to update the sparse bindings though vkQueueBindSparse

delicate rain
#

Right okay correct me if I'm wrong but the popping comes because you try to read sparse memory at a clip level where nothing is located since the requesting logic is the same as sampling logic. Aka it tries to read clip 3 while the memory is sill bound to clip 4?

#

Due to the single frame lag

wicked notch
#

It tries to read clip 3 but the memory is unbound

#

Right now I immediately unbind all pages that are not request in the current frame

#

so if in the previous frame some page wasn't allocated, but the current frame says "now it is allocated boi", I sample an unbound page

#

The only way to fix it is to have no frames in flight (and stall before sampling) KEKW

delicate rain
#

Hmmmm

#

I know how to fix the popping due to clip level switch

#

You just delay the clip level you sample by a single frame

#

But disoclussion and edge of frame will still pop

#

Okay I have a cursed idea

#

What if we tried to combine fake sparse and hw sparse

wicked notch
#

how so

delicate rain
#

We use hw sparse for pages that are only switching lod levels - since you can just delay the sampling as I described. But for pages that cover pixels that were previously not visible at all we do fake sparse with no delay

#

Then once these pages will switch lod we move them to the hw sparse path

wicked notch
delicate rain
#

In your vsm page table you'd need to store info if a page is real or hw sparse to know what to sample

#

Ah but this would double the amount of clipmaps we have to draw

wicked notch
#

Double the clipmaps but with efficient culling that's not a problem smart

delicate rain
#

The culling is also the same for non resident pages so we could do two step culling - first share cull for both real and fake and second cull again now only for pages of real and fake separately

#

But I feel like this is extremely cursed

wicked notch
#

Let me try getting rid of the timeline semaphore and just stalling the device

#

because that's what good people do, they stall the device

#

Ok I got a bsod

#

very good

delicate rain
#

We just need bindSparseIndirect smh

#

Is that too much to ask for?

wicked notch
#

goddamnit hw filtering looks so good

#

fuck me

wispy spear
#

but the textures look like shit, some washed out BDU pants

wicked notch
#

This debug view is so good

#

In white are the active pages relative to what the shadow map sees

raven orchid
#

do you have hpb running as well?

#

like sparse depth + hpb cull

wicked notch
#

working on it rn

#

I already have the code I just have to make it work™️

raven orchid
#

i'm super interested in the final perf

#

adding hpb really helped mine

#

would you say

#

your current sparse approach is getting close to nvidia's GL driver impl of sparse?

wicked notch
#

Nah

#

Mine is very rudimentary right now

#

It's just "update what changed and semaphore"

#

No deferred or staggered updates or stuff like that

raven orchid
#

hmm ok makes sense

#

i wonder if this means sparse will be viable after all

wicked notch
#

I hate readback with a passion though bleakekw

raven orchid
#

save us from the readback

#

gpu driven sparse management

wicked notch
wicked notch
#

while we're at it, why not make it also perform well out of the box

cold sky
#

ah rasterization, teh paradigm where you heroically toil to overcome problems unknown in raytracing

raven orchid
#

like from your experiment it sounds like it might be possible, but also seems very tricky to get the API to not obliterate performance

wicked notch
#

It's tricky but completely doable

#

Right now I have the most naive implementation possible, and vkQueueBindSparse takes at most 10ms

#

When updating across all 16 clipmaps

#

Also, viewing clip 2 in nsight makes me have a device loss error?

raven orchid
#

Dang that’s surprisingly not bad

#

Wonder how low you can get it

wicked notch
delicate rain
#

The fact that you have to have no fif makes it really bad though

#

I'm not sure there is a way to overcome that

wicked notch
#

I still have 2 FIF

#

Thing is, you don't just need 1 frame in flight

delicate rain
#

yeh but your shadows are scuffed

wicked notch
#

you need 1 frame in flight and issue a complete queue stall

#

you need to do both to not have scuffed shadows

delicate rain
#

graphics queue stall?

wicked notch
#

ye

glass sphinx
#

is that a waitidle?

#

on queue?

wicked notch
#
while (is_running()) {
    mark_pages();
    readback();
    wait_idle();
    update_bindings();
    render_shadows();
}```
glass sphinx
#

so no frame in flight?

wicked notch
#

no frame in flight

delicate rain
#

why do you need the wait idle?

#

is that just how sparse works?

wicked notch
#

nono, you can omit the wait idle, you just have to work with an older readback then

#

the wait idle is there to complete the transfer from device to host

glass sphinx
#

thats fine tho

wicked notch
#

waiting for transfer?

glass sphinx
#

delayed readback

wicked notch
#

ah yeah

wicked notch
delicate rain
#

We delay the user input for FIF and than the readback will be fine

wispy spear
#

just show a loading screen

#

or some loading.gif on a quad

#

starfield did it

#

cod did it

#

why not showcase_vsm.exe

delicate rain
#

we have cool shadows but you cannot move or rotate camera or else loading screen

wispy spear
#

: D

#

ok im sorry, just here to keep the morale up

wicked notch
#

I'll stick with HWVSM for a while more

#

after all SWVSM is a git checkout away

wicked notch
#

I am so tempted to actually put a stall and see what happens bleakekw

#

father forgive me

#

don't mind the giant hole though

#

Totally not issuing a full stall 💀

wicked notch
#

so good

cold sky
#

Git good

#

At fortune telling

#

Or loddin

frank sail
#

The pop in would probably be less bad if you cached pages that weren't just on screen

wicked notch
#

ye

#

once again caching would save my ass

frank sail
#

there'd still be popping when you see new pages so meh

wicked notch
#

I already have a full stall in my code rn so bleakekw

#

by writing wait_idle() I have forsaken my humanity Jaker

cold sky
#

the annoying thing to compensate for are disocclusions

#

run a small blur/average over your pages to score them ? So that way a not-right-now needed page feels important cause it has resident neighbours

frank sail
#

Or just use a GPU allocator for zero frames of disocclusion lag smart

cold sky
#

and HW rasterizer neatly dropping writes to non-resident pages

frank sail
#

yeah but you actually rarely have to draw vsm pages so it's not a big deal

wicked notch
#

moving the camera causes a lot of cache invalidations though and perf dips quite a bit

frank sail
#

that's why you don't move the camera

cold sky
wicked notch
#

if the sun moves it's joever

cold sky
#

also you know, if dynamic occluders within the pages move too XD

frank sail
#

Nuh uh

raven orchid
#

I think unreal is trying to move to a dual vsm mem pool solution to handle dynamics

#

Then they just merge dynamic with static cache

frank sail
#

Ye

cold sky
raven orchid
#

Yeah idk what their plan is

#

Or maybe they’re not using real sparse api? Idk

cold sky
#

from what I can see your biggest bottleneck is the 1 frame lag + how many pages you can bind per frame

#

not the actual drawing, which is hilarious

raven orchid
#

Though I know mac doesn’t support sparse and I think ue5 runs there right? Maybe they’re using software sparse

delicate rain
#

They have Nanite they don't need real sparse

cold sky
#

and AMD has sparse since then

raven orchid
#

Oh they ditched amd

cold sky
#

tbf Nanite is the point at which I'm like... fuck rasterizing

raven orchid
#

At least on apple silicon it reports no sparse in the vulkan support viewer

delicate rain
#

Also I'm pretty sure you still need screen space shadows and other stuff because otherwise VSMs just die when you have stuff like moving grass and swaying trees

cold sky
wicked notch
#

you need SSS regardless of those because of LOD KEKW

raven orchid
#

Yeah I hope so

#

I think metal might be approaching feature parity?

cold sky
wicked notch
#

ye, iirc nanite uses LOD for their VSMs too, so when there are mismatching LODs between views, they run a screen space trace to fix the shadows

cold sky
frank sail
#

Lod transitions are basically unnoticeable in VSM unless the lod bias is high

delicate rain
#

Oh you just match the lod to the clipmap level

#

Ez clap

wicked notch
#

but then rip LOD

cold sky
#

just #ray-tracing

#

stop it

#

get some help

delicate rain
#

Yeeah

cold sky
#

build yourself a BLAS

wicked notch
#

build a blas

#

rayQueryEXT

#

life is good

delicate rain
#

It was fun while it lasted but I'm starting to be a doubter

raven orchid
#

I think ue5 will be all in vsm

#

But they’ll probably go rt for ue6

cold sky
#

in the time you've spent fucking around re-implementing a SW Rasterizer or optimizing for the bottleneck of Sparse Bind, you could have made a BLAS builder

delicate rain
#

They have rt as the highest quality settings

cold sky
#

HW render will just die on multiple lights

raven orchid
#

Yeah like they say you can use it with non nanite but

#

They don’t recommend it froge_bleak

cold sky
#

there's only so many renderpasses/subpasses you can churn through in a frame

#

and its not like you can have a separate viewport per layer

raven orchid
#

I wonder what

#

I wonder how a hybrid solution would look

cold sky
#

I guess you could do a layered render, if you kept all VSMs same size

raven orchid
#

Rt for non nanite, vsm for nanite

wicked notch
#

unreal tells us that we can have as many lights as we want because all VSMs are 16k and their culling is 100% effective

cold sky
#

I always recommend #ray-tracing

wicked notch
#

as in, inactive pages do not contribute to any cost

cold sky
wicked notch
#

They only call Nanite::Rasterize once, it rasterizes all viewports

cold sky
raven orchid
cold sky
#

a bit less in HW (viewports need to be binned)

raven orchid
#

I actually don’t have any non potatos to test with

cold sky
cold sky
raven orchid
#

Rip to my setup froge_sad

cold sky
#

also each light requires that you analyze the pixels on screen and vote on which pages should be active

#

so by that virtue alone, there's a limit on lights

#

and I think it might be in the low tens per pixel (counted by area/volume of effect, not visibilit/contribution)

raven orchid
#

Hmmm I wonder if they publish that limit anywhere

cold sky
raven orchid
#

Idk how many virtual lights they allow

cold sky
raven orchid
#

Yeah I meant like

wicked notch
raven orchid
#

I wonder if they have a “best practices” section for point lights

cold sky
#

when FPS drops to 5, there's your limit

wicked notch
#

you could do conservative culling to improve perf

#

as in, cull only the N most important meshlets this frame

cold sky
cold sky
#

and you gib a fixed budget (of rays)

raven orchid
#

Checked their docs

#

They currently don’t support per-light resolution controls and it looks like they don’t yet expose a page update budget control

wicked notch
#

ye they only do 16k

raven orchid
#

But they’re both listed as “we’re working on it” so maybe it’ll get a bit better soon for non directional

cold sky
#

non directional will completely wreck your page occupancy

#

persp projection blows up anything near the light source

#

and suddenly all your pages are active bleakekw

#

VSM works with directional lights cause the projection is ortho

#

actual lights in the scene there really isn't a question of "are there any empty areas in our shadowmap"

#

more like "what LoD should be page them at"

frank sail
raven orchid
#

The virtual mip chain you mean?

cold sky
cold sky
#

@wicked notch I have a silly way to optimize your HiZ

wicked notch
#

I'm all ears

cold sky
#

well not HiZ but early Z

#

do you need to draw your meshlets for different mip levels separately ?

wicked notch
#

yes

#

I have 16 calls to vkCmdDrawMeshTasksEXT

#

16 is num_clipmaps

cold sky
#

for the higher (low res) mip pages, don't draw geometry (and cull) in parts that are overlapped by higher res resident mips

wicked notch
#

I could theoretically squash them into one

#

But then I hit the max numer of meshlets that I can rasterize with that func

#

Which is 1 mil for some reason

cold sky
#

you can downsample the higher pages into the lower pages

#

a resident high level page covers 1/4 of the page immediately below it (whether the lower one is resident or not)

#

hmm I guess you can improve both HiZ and earlyZ 🤣

#

basically only rasterize geo for resident 1/4-pages that don't have a resident page directly beneath them in the mip-chain

#

that way you won't be drawing any meshlets to the highest mips

#

at all

#

btw you don't need to interlave compute/FS, you can do the downsampling all at the end after everything has been rendered

#

clever, eh?

wicked notch
#

hm

cold sky
#

this can probably cut your Mesh/Vertex processing time in 16x

#

nvm, still a good gain

delicate rain
#

I suggested something similar way back, but not with the rasterize 1/4 that were not resident, I wanted to do downsample when you switch clip and the higher clip is fully covered by the lower clips

#

I'm not sure it will bring that big of an improvement thought

wicked notch
#

worth trying

wicked notch
#

I have implemented deferred binding updates as well as batching

#

However, there's a very sad fact

#

Updating a couple dozen pages takes 5ms

frank sail
#

let it go m8

wicked notch
#

no

#

more tricks are to be done

#

I really wanna do caching

#

but my brain is too small

#

Jaker can I trouble you to explain how to snap the camera to page offsets once again

#

also how to mark pages dirty

#

🥺

frank sail
wicked notch
#

ye

frank sail
#

sounds not worth

wicked notch
#

I'm fully convinced that caching will let HW sparse shine

#

I just don't really understand how to do caching

frank sail
frank sail
#

I set this bit in the allocator

wicked notch
#

What if a page remains allocated for two frames in a row but the camera snaps to another offset

frank sail
#

nothing happens to it

#

because of my epic wrap addressing

frank sail
wicked notch
frank sail
#

the beauty is that there is no "decision" as in an if statement

#

but ye lemme show u

#

do you have VirtualShadowMaps.cpp open already

#

the function in question is DirectionalVirtualShadowMap::UpdateOffset

#

I can explain each line

wicked notch
#

Alright, please do KEKW

#

I feel very dumb

frank sail
#

I'll post it

#

this will be the reference I guess

  void DirectionalVirtualShadowMap::UpdateOffset(glm::vec3 worldOffset)
  {
    for (uint32_t i = 0; i < uniforms_.numClipmaps; i++)
    {
      // Find the offset from the un-translated view matrix
      uniforms_.clipmapStableViewProjections[i] = stableProjections[i] * stableViewMatrix;
      const auto clip = stableProjections[i] * stableViewMatrix * glm::vec4(worldOffset, 1);
      const auto ndc = clip / clip.w;
      const auto uv = glm::vec2(ndc) * 0.5f; // Don't add the 0.5, since we want the center to be 0
      const auto pageOffset = glm::ivec2(uv * glm::vec2(context_.pageTables_.Extent().width, context_.pageTables_.Extent().height));
      uniforms_.clipmapOrigins[i] = pageOffset;

      const auto ndcShift = 2.0f * glm::vec2((float)pageOffset.x / context_.pageTables_.Extent().width, (float)pageOffset.y / context_.pageTables_.Extent().height);
      
      // Shift rendering projection matrix by opposite of page offset in clip space, then apply *only* that shift to the view matrix
      const auto shiftedProjection = glm::translate(glm::mat4(1), glm::vec3(-ndcShift, 0)) * stableProjections[i];
      viewMatrices[i] = glm::inverse(stableProjections[i]) * shiftedProjection * stableViewMatrix;
    }

    uniformBuffer_.UpdateData(uniforms_);
  }
#

so first off, we are calculating a separate offset for each clipmap (since each one has a different page size)

#

the offset we want needs to be a multiple of the page size for a given clipmap

#

this line calculates the viewproj of the clipmap as if it were locked to the origin (it still has a rotation component)

uniforms_.clipmapStableViewProjections[i] = stableProjections[i] * stableViewMatrix;
wicked notch
#

How does a stable projection differ from a non stable one

frank sail
#

at one point I was offsetting the projection matrix rather than the view matrix, but that fucked up math for other stuff later on, so I removed it

#

now I just explicitly call them stable so I'm certain about what I'm looking at

frank sail
#

it provides a reference point (coordinate space?) I guess

#

oh, btw, worldOffset is the position of the player camera. we are trying to center the clipmap on it

wicked notch
#

hm

frank sail
#

these three lines are transforming the player coord to [-0.5, 0.5] space (half NDC space?) of the stable clipmap viewproj we just made

      const auto clip = stableProjections[i] * stableViewMatrix * glm::vec4(worldOffset, 1);
      const auto ndc = clip / clip.w;
      const auto uv = glm::vec2(ndc) * 0.5f; // Don't add the 0.5, since we want the center to be 0
#

note that it's perfectly fine for the resulting coord to not actually be in [-0.5, 0.5]. that just means it's not within the frustum of the stable viewproj (which is highly likely for the smaller clipmaps)

#

to get the all-important offset, we just multiply that "uv" by the number of pages in the clipmap

const auto pageOffset = glm::ivec2(uv * glm::vec2(context_.pageTables_.Extent().width, context_.pageTables_.Extent().height));
#

(btw it might be better to round than to truncate, idk)

#

anywho, this offset tells us how many page widths we need to translate the clipmap camera to be approximately centered on the player

delicate rain
#

since -1 -> 0 <- 1

frank sail
#

at worst you'll be off by one page which probably isn't noticeable ever

delicate rain
#

it can be noticeable for the higher clipmaps

#

butyeah

frank sail
#

what I mean is that the camera won't be perfectly centered on the player

#

it's not like the shadow will be wrong

delicate rain
#

ah I see what you mean

#

yeah

frank sail
#

that's why it's probably impossible to notice unless you are somehow looking at every page in the clipmap

#

anyways

#

back to the explanation

#

the projection matrix allows us to conveniently apply a shift in NDC space by translating it

      const auto ndcShift = 2.0f * glm::vec2((float)pageOffset.x / context_.pageTables_.Extent().width, (float)pageOffset.y / context_.pageTables_.Extent().height);
      const auto shiftedProjection = glm::translate(glm::mat4(1), glm::vec3(-ndcShift, 0)) * stableProjections[i];
#

we are only interested in translating it on XY because we don't want depth to get fucked when the player moves

#

so we're basically sliding the bad boys on a plane

frank sail
wicked notch
#

one quicc question

#

the mul by 2 is to go back to ndc?

frank sail
#

pageOffset / numberOfPages generates a UV-space value

#

but projections work in NDC (actually clip space but yolo) I guess

#

so ye u are correct

wicked notch
#

if it worky it worky

frank sail
#

here is the final line in the loop

viewMatrices[i] = glm::inverse(stableProjections[i]) * shiftedProjection * stableViewMatrix;
#

ok I think that line is stupid

#

I mean it works

wicked notch
#

Yeah I dunno what the hell is going on here

frank sail
#

I'm basically extracting the shift from the shiftedProjection (undoing all the projection parts) and then applying it to the view matrix

#

so it's just translating the view matrix bleakekw

frank sail
#

except it's view space instead of NDC, which is trivial to convert to

wicked notch
#

so basically

#

you make a shifted projection using a stable projection

frank sail
#

the last three lines could probably be replaced by viewMatrices[i] = glm::translate(stableViewMatrix, glm::vec3(pageOffset * frustumSize, 0));

wicked notch
#

then you undo the "projection" part of "shifted projection" by multiplying with its inverse

#

and that's your translated view matrix

frank sail
#

ye

#

it's kinda dumb though as I just noted