#Iris - A Journey through OpenGL and beyond to learn Graphics
1 messages · Page 12 of 1
oh you mean latency in the hzb?
I probably won't matter much anyways
my man is speedrunning this
(it's a reference to Bon Jovi's Livin' on a Prayer song, in case someone doesn't catch that)
I thought it was a reference to another song but ye
vec4 project_screen_aabb(in aabb_t aabb, in mat4 transform, in mat4 proj_view) {
const vec3[] corners = vec3[](
vec3(aabb.min.x, aabb.min.y, aabb.min.z),
vec3(aabb.min.x, aabb.min.y, aabb.max.z),
vec3(aabb.min.x, aabb.max.y, aabb.min.z),
vec3(aabb.min.x, aabb.max.y, aabb.max.z),
vec3(aabb.max.x, aabb.min.y, aabb.min.z),
vec3(aabb.max.x, aabb.min.y, aabb.max.z),
vec3(aabb.max.x, aabb.max.y, aabb.min.z),
vec3(aabb.max.x, aabb.max.y, aabb.max.z)
);
vec2 min_xy = vec2(1.0);
vec2 max_xy = vec2(0.0);
for (uint i = 0; i < 8; ++i) {
const vec4 clip = proj_view * transform * vec4(corners[i], 1.0);
const vec2 ndc = clamp(clip.xy / clip.w, -1.0, 1.0);
const vec2 uv = ndc * vec2(0.5, -0.5) + 0.5;
min_xy = min(min_xy, uv);
max_xy = max(max_xy, uv);
}
return vec4(min_xy, max_xy);
}
bool is_meshlet_visible(in vec4 box_uvs) {
const vec2 min_xy = box_uvs.xy;
const vec2 max_xy = box_uvs.zw;
const vec2 hzb_size = vec2(imageSize(u_vsm_hzb));
const float max_mip = 1 + floor(log2(max(hzb_size.x, hzb_size.y)));
const float width = (max_xy.x - min_xy.x) * hzb_size.x;
const float height = (max_xy.y - min_xy.y) * hzb_size.y;
const float mip = clamp(ceil(log2(max(width, height))), 0, max_mip);
const bvec4 samples = bvec4(
bool(texelFetch(u_vsm_hzb, ivec2(box_uvs.xy * hzb_size), int(mip)).x),
bool(texelFetch(u_vsm_hzb, ivec2(box_uvs.zy * hzb_size), int(mip)).x),
bool(texelFetch(u_vsm_hzb, ivec2(box_uvs.xw * hzb_size), int(mip)).x),
bool(texelFetch(u_vsm_hzb, ivec2(box_uvs.zw * hzb_size), int(mip)).x)
);
return any(samples);
}``` this be kinda weird
583 device lost and counting
BORN TO TDR / GPU IS A FUCK / Crash Em All 1989 / I am trash man / 410,757,864,530 LOST DEVICES
lol
so first culling attempt
still takes a fuckton of time to rasterize
the regular raster path takes 47 microseconds (without culling lol), VSM raster takes 1m (with culling)
And the VSM hardware raster shader is sus
But I don't see anything extremely slow with it, except the likely divergence lol
How loading gl_FragCoord cause a short scoreboard dependency is beyond me
download RGP
there you go AMD man
I guess gl_Position is stored in four SGPRs which have almost no latency to load
so uh I guess that won't cause a stall on AMD
idk what NV is doing, but it's probably similarish (they don't have SGPRs, but gl_Position should come from some low-latency memory somewhere)
so how fix
ok actually that s_load_b128 is loading gl_Position into four SGPRs from memory somewhere (s[0:1] is two SGPRs holding an address)
epic
caching time then
"The best rasterization technique is no rasterization"
~Sun Tzu
another sus thing is the PROP util
how is PROP the top sol when it handles early-late Z/depth testing and blending
all of which are disabled in the VSM pipeline 
I'm so conchfused rn 
if it makes you feel any better, my hzb is broken and idk why
Btw since turing nvidia has uniform registers, which are nearly identical to sgprs on amd afaik
they are also integer only
the sgpr operations can not operate on floating point
well that's all registers
i guess they can load whatever
yea thats the same on amd
i had to learn that the hard way at work
adding one mul made the vgpr use explode by 12
🥸 sometimes compiler optimizations bite my ass when they suddenly get turned off by some bingus
Scalar ALU (SALU) instructions operate on values that are common to all work-items in the wave. These
operations consist of 32-bit integer or float arithmetic, and 32- or 64-bit bit-wise operations. The SALU also can
perform operations directly on the Program Counter, allowing the program to create a call stack in SGPRs.
Many operations also set the Scalar Condition Code bit (SCC) to indicate the result of a comparison, a carry-out,
or whether the instruction result was zero.
i an asure you amd can not do floating point ops with the scalar alu
where is this from?
what makes you say that?
i believe i searched for it in the isa before
how are they named
i cant find any s_XXX_f32 instructions
I cant find a single scalar f32 instruction 😦
I guess my snippet is
because neither can I
Scalar ALU (SALU) instructions operate on values that are common to all work-items in the wave. These
operations consist of 32-bit integer or float arithmetic, and 32- or 64-bit bit-wise operations. The SALU also can
This is really specific on saying float arithmetic works huuuuuuuuh
i guess its just very unfortunate wording
the registers can be used for VALU ops but ye
mhmm
but there are rumors that rdna 4 gets float support on the salu
that would be sick
conchfusion levels are reaching heights I didn't think were possible
@frank sail if you disable caching in your thing, what does the GPU trace look like
ye can you show the trace
oops one sec
got distracted with manmade horrors in #questions
hold up imma delete that pic
better crop + relevant pass is selected so the counters are accurate
but this is completely bogus
I gotta schleep but I'll talk l9er
Ye btw I was testing on this
Which I guess has low geometry complexity but whatever
I wonder if low geometry density is the Achilles' heel of this since there will be too much overdraw
Not overdraw but rather fs invocations
baffling to say the least
is it mesh shaders?
somehow mesh shaders are garbage for this?
I dunno
let me try something
Btw I hypothesize that reducing the VSM resolution could improve perf as we'll have far fewer fragments for geometry hanging off the edge of the visible bounds
As we've learned, only a small fraction of the VSM is in use at a time which makes it a viable change
Did u make sure to select that pass so we're looking at the right counters
If so, then those are spooky numbers, Mason
Are you thinking just halve the virtual res?
The other option would bring back readback
I’m using 8k personally
I'll let my brain machine work
you go schlepp jaker, you don't have to suffer with me 
Hold on I just had an unhinged idea
Are 8192^2 textures a thing that we can make
We could make an 8 bit stencil texture and use that for early-stencil
that is so unhinged I have no idea how that works 💀
Oh dang that brings back week 1 vsm memories
We would have to populate the texture to indicate where dirty pages are
But the beauty of it is that you just need one total, as long as you're okay with
foreach view:
CullMeshlets(view);
Draw(view);
why is a 8192 texture required tho
Now it's a question of whether the early-s will actually save any perf
Because we can't make a 16k^2 texture
I severely doubt that
Hmm
My potato gpu let me
I'm bottlenecked by PS invocations so it'd be nice to skip a bunch of them
Not as good as viewport culling, but it's perhaps something
Can you resize a viewport using the gpu
At some point I'd guess that filling out the bigass texture will take a lot of time
No 😢
Crap
Device generated commands
For each draw in a sequence, the following can be specified:
a different shader group
a number of vertex buffer bindings
a different index buffer, with an optional dynamic offset and index type
a number of different push constants
a flag that encodes the primitive winding
Rip formating
Hmmm
So this issue
Is it that too many meshlets are spilling into non dirty or non resident pages so there is wasted work?
Setting the front face from the gpu seems useless
Ye that's the theory
It should only be bad for large meshlets, however
We both use 64/64 ultra small meshlets sir
I mean physically lol
Hmmm
Low geometry density is the problem
We need nanite 1 tri/px
Oh yeah, that's probably why it works so well for them
damn the small indie company Epic Games really did think it through
Other solutions
- per triangle culling
- per triangle clipping against min/max
- sw rast
So we just need to take a small detour and implement full nanite
Then our vsms are done
well that was in the todo list for me 
I "just" need to figure out graph partitioning
Also others:
- readback to modify viewport but 1-3 frames behind
- only render 1/4 of each clip per frame
Wait so what kind of speed up did hzb give?
My ultra naive shrimple HZB did a pretty good job at removing wasted mesh shader invocations
but well
it's still ultra bad
It's currently broken so I can't say
But preliminary results say: beeg
Other option is like
Btw another solution
- increase the size of the smallest VSMs so meshlets aren't big compared to them
How tiny do we need the base clip map
Oh yeah that’s where I was going right now lol
My confession is I never wanted vsm quality shadows
I just wanted csm but with sparse caching
heresy
Have you heard of parallax corrected cached shadows
Tldr solution for the sun rotation with cached shadows
You can still have it rotate in huge quantized steps, but you do a ray marching hack to find where the shadow "would be" if the sun was at another angle
Does that basically let you reuse old cached data?
So you can get the appearance of smooth rotation while rendering the shadows very infrequently
Oh wow that actually sounds badly needed
Idk how it works with dynamic geometry unfortunately
I think far cry 5 (the game that uses them iirc) only did the parallax thing for their long distance adaptive shadows
They did regular csm up to like 30m
I still think vsm has the potential to replace csm fully
caching + good culling has potential
but caching + good culling + 1 tri/px should be best
you just need a GPU capable of rasterizing SCREEN_RESOLUTION triangles N times 
Caching + good culling + viewport resize + temporal shadow update budget + parallax corrected + stencil rejection (maybe???)
extremely tedious
And yeah if you just go full nanite that’ll be the best
but possible 
Oh ye the other part is that nanite can make these be 1px in light space
I'm working with 1 resolution of triangles
Tf
This is how shadow map enjoyers cope
yes I'm a SHADOWMAP enjoyer
S help
H orror
A re
D neverending
O this is a legitimate call for help
W
M
A
P
Also I should mention I did this but
Shadow quality is still way better than csm for the same perf
So despite the nightmare fuel I think vsm is still worth it
So basically treating VSM as a nuclear bomb that makes CSM problems disappear
Tfw no 30 clipmaps at 16k res 😔
Consistent across all scenes
Actually I can do more than 4 and perf stays the same
Probably 10 would be doable
Literally 0? Because idk how that'd work
Well I still do normal vector offsetting during sampling
O
But slope biasing I removed
Yeah that one
There is a formula to calculate the exact bias needed but I couldn't get it to work so I just added more constants 
I'll get to it later lol
Along with proper filtering/10k lines of SMRT
Anyways I gotta sleep fr
hmm
not sure if this is of any interest to you my frog
The Mini But Mighty Fantasy Unreal Engine Environments is a one week only tiny bundle of very high quality fantasy environments. While it's for Unreal Engine, as you can see in the video it's trivial to get the assets into the Godot Engine and even Blender. (Unreal has amazing built-in exporting capabilities)
Links
https://www.humblebundle.c...
Oh the castle looks awesome
Lmao that would have been insane
when you watch it now, its quite underwhelming somehow
Really? Those are some standards you have mister
it was stunning back then
ok so
I have been stomped on the great PROP incident for the past week
I have zero clue as to what to do besides implement caching
but that just sounds like a bandaid to some ultra weird underlying problem
You can always pause this and begin caching and hzb
More like h"z"b since there isn't a z
yeah I figure that's all I can do
but caching still scares me
whatever Saky did was terrifying 
Ah
I think you can do something shrimpler, like what I did
(which still took me weeks to implement because smooth brain)
ye perchance
first step is getting the stable addresses
So you can get sharp shadows everywhere, but while keeping the page addresses the same
Second step is making sure pages are marked correctly (dirty pages should only be the ones that were just alloc'd this frame, pages that were already visible and continue to be visible should be untouched)
Third step is making the pyramid of dirty pages and culling against it like an hzb
It took me like a week but it's literally 10-20 loc
That's part of step 1
It hurts my fingers to write how it works on mobile
This fn can be cheaply called every frame
https://github.com/JuanDiegoMontoya/Frogfood/blob/526f7a6645207abbcb05a26a1f778b3ba05580d2/src/techniques/VirtualShadowMaps.cpp#L349
(If there is no rotation)
Basically this function computes
- A view matrix that is snapped to the grid to use for rendering the shadow map
- An offset to apply in the fragment shader of rendering the shadow map which "corrects" the page address. This is needed because the fragment shader only knows about window space (gl_FragCoord)
Every other lookup into the VSM uses the stable viewproj, which is placed at the origin, and fract(uv)
Wait which incident was this?
Idk what it stands for
We need to give this a special name
Virtual occlusion culling, voc
Idk
page overdraw gulling (pog)
virtual undesirables kulling (vuk)
Maybe just hierarchical page culling (hpc)
But it isn't culling pages 
"Hierarchical page buffer" can't be misinterpreted probably
As long as it has "hierarchical" in the name tbh
This is the great PROP incident btw
rendering VSMs for me takes a huge amount of time, the profiler just tells me "idk what it is but the PROP unit is dying"
16k viewport yeah
FYI mine just renders 4k now (but 16k should be okay after the new culling)
It doesn't affect correctness at all except at very high resolutions
I'm not at computer to test
Before culling, it worked fine but with worse perf
Now I suspect the perf hit is much smaller
do you have the thing 🅱️ushed
Yes, but there is still a bug
A 1 liner if you want to pull
Oh wait I didn't push that bug
Hmm hold on, lemme boot and push (might be a few)
smh imagine not pushing bugs
Well I committed it, just haven't pushed 
btw
but since ndc coords can go beyond [-1;1] this is potentially not the actual min isn't it?
perhaps a fract is needed?
Oh
Yeah there is no actual min and max, you understood right
They should be infinity and negative infinity to start I guess
Fract isn't needed though
I use a repeat sampler in this case
Though fract could be used instead
I project the camera position by the sun view projection matrix (sun is at 0,0). Then I divide the ndc position by ndc page size and take the ceil of that and that's it you got your offset
That is if you want to do the shrimple Jaker view
Not sure if I tagged the correct message 😅
guys I think I'm going insane
gl_FragCoord is in window space right?
That means if I am rasterizing a 16384^2 viewport, gl_FragCoord should be in range [0;16384) right?
Yeah
ok
Apparently I had a massive bug
but the thing just worked because some genius decided that 16384 / 128 = 128
Lmao I'm always scared of changing the numbers I have in my defines
Because it's 50% chance of breaking every piece of math that was previously aligned
ye that's exactly what happened
as it turns out
rendering twelve 16k shadow maps is hard
and as it turns out again
halving the resolution unbounded me from the PROP
Nsight crashed my computer
thank you nvidia
.capacity = (1 << 24) >> 4, // 2^24 meshlets, packed```
I could've just written 1 << 20
but this is funnier
: D
SHRTS should be SHITS
obligatory culling failure
I am very much not confident in this:
bool project_screen_aabb(in aabb_t aabb, in mat4 transform, in mat4 proj_view, out vec4 box_uvs) {
const vec3[] corners = vec3[](
vec3(aabb.min.x, aabb.min.y, aabb.min.z),
vec3(aabb.min.x, aabb.min.y, aabb.max.z),
vec3(aabb.min.x, aabb.max.y, aabb.min.z),
vec3(aabb.min.x, aabb.max.y, aabb.max.z),
vec3(aabb.max.x, aabb.min.y, aabb.min.z),
vec3(aabb.max.x, aabb.min.y, aabb.max.z),
vec3(aabb.max.x, aabb.max.y, aabb.min.z),
vec3(aabb.max.x, aabb.max.y, aabb.max.z)
);
vec2 min_xy = vec2(1.0);
vec2 max_xy = vec2(0.0);
for (uint i = 0; i < 8; ++i) {
const vec4 clip = proj_view * transform * vec4(corners[i], 1.0);
if (clip.w <= 0.0) {
return false;
}
const vec2 ndc = fract(clip.xy / clip.w);
const vec2 uv = ndc * vec2(0.5, -0.5) + 0.5;
min_xy = min(min_xy, uv);
max_xy = max(max_xy, uv);
}
box_uvs = vec4(min_xy, max_xy);
return true;
}
specifically the const vec2 ndc = fract(clip.xy / clip.w); part
hmm would it help to visualize the values?
they're kinda difficult to visualize
but perchance
actually they may not be difficult at all
maybe with a little debug switch in there so that you can toggle that on and off from outside
when I was having culling trouble I rendered out debug quads of the depth buffer linearized
so I could see what was there
(debug quads of the projected sphere)
what da views doin
ight caching time
I be wondering
if before caching
I should reimplement my nanite
just to see if it actually benefits
we do a little detouring
I have all the code ready, I just have to mash it together
and make it work (optionally)
LVSTRI procrastinating caching I'm procrastinating culling - I like the dynamic
uh methinks you should do the naniteisms later
ye perchance
man brain fog is real
me: where the fuck is my phone
also me: has phone in my fucking hands with the flashlight on
early onset dementia fr
lol happened to me recently as well, was under the desk to fiddle cables in the darkness 😄
#youarenotalone
i have a question
how long did it take to implement lod 0 sw rasterizer in compute for virtual geom?
It has left me feeling shaken
shaken belief in a culling solution?
devsh mentioned interpolateAtOffset which is viable
actually you'd have to write to gl_FragDepth which would kill early z on most hw anyway
meh
actually
yeah just render some quads hehe
wait why would you write to gl_FragDepth
you don't
o
I was trippin fr
we good then?
page sized point?
uh hold on lemme think
I guess a quad actually makes sense if it needs to cover more than one sample
Rects would be better but only NV supports those 
Quads will suffice
hmm do we need to cover more than one shrimple
I'm approaching this the same way as ROC
Yeah unless you want to shade 128² pixels in your fs 
but hold on
how do I get the quad's depth?
unless I render the projected meshlet's AABB?
I was thinking it would be 1 for active pages and 0 for inactive pages
so early-z would just cull fragments in inactive pages
Hmm that would be reasonable
when u actually render the geometry, the regular depth will be tested against that
it doesn't, not anymore
rip msaa
devsh mentioned that it's only for when depth has higher samples than color which is the opposite of what he proposed
according to the spec
ah rip
but we can use interpolateAtOffset anyways so MSAA always sucked
so we declaring gl_FragCoord varying?
is that real syntax
maychance
in vec4 gl_FragCoord; or something
at the very least, we can declare our own window-space vs output
also, don't we need conservative raster for this to work
if each fs invocation is to shade multiple pixels
you know what
Imma draw a fullscreen tringle
and check whether I should output a depth of 1 or 0 in the frag shader 
it's either that or clear+quad per active page
I think the latter would be faster in cases where there are few active pages
but there is still this issue for the drawing of the vsm
hmm I don't really have a real grasp on how conservative raster works
The only time I read about it was for HFTS
Nvidia's bullshit technique for shadows 
it means a fragment is generated if the triangle touches a texel rather than if it covers the center
it can also mean a fragment is generated if the triangle covers the entire texel, depending on the mode
at least AMD and NV support the ext in vulkan
but in GL, only NV supports it 
time for triangle dilation geometry shader 
High Fructose Torn Syrup
you also use it for voxelizing, iirc devsh mentioned the reason you'd want it in that case over msaa voxelization
I just forgot
imagine, in the extreme case, that you want only one fs invocation for each page
so you only need to fill a 128x128 depth buffer to get early z
a triangle could cover a significant amount of a page without actually covering the center sample, but obviously you care about all the texels in the page being written
but if the triangle doesn't cover the center sample, nothing gets written unless you use conservative raster
realistically, you'd only do like 4x4 or 8x8 pixels in the fs so the error wouldn't be so large, but it would still be less than ideal
wait hold on
why do you wanna manually dilate though, with NV at least the conservative rasterization ext is available since maxwell afaik
I thought we were using a depth buffer as big as the VSM but 16 bit?
(which still blows me out but what are you gonna do)
yeah that's the fallback
for gpus without it lol (also that was a meme suggestion)
I don't think anything older than maxwell should be legally allowed to run VSM
16-bit depth is ass though
we only store a 1 or a 0 it's fine
why not use a stencil buffer
almost forgot that we actually have two depth buffers 
is there early stencil?
yes
same
I don't even know how you actually render into one 
I thinnk you set the read/write ops in a similar way to configuring depth ops
they just work differently
glStencilMask similar to glDepthMaskisms?
and glColorMask(false, false, false, false)
probably
typedef struct VkStencilOpState {
VkStencilOp failOp;
VkStencilOp passOp;
VkStencilOp depthFailOp;
VkCompareOp compareOp;
uint32_t compareMask;
uint32_t writeMask;
uint32_t reference;
} VkStencilOpState;``` wtf is this
btw this all only helps the case where every page isn't being drawn to, i.e., when culling is also working
to be fair, my HZB is extremely iffy sometimes 
and it's definitely more conservative than necessary
there is a learnopengl tutorial for stencil 
back to our origins huh
when im back from memory transfers it better works
filling a big chungus stencil buffer for every draw sounds cap
also you literally can't even make the stencil buffer for some of the resolutions we target
ye true
if only we could do MSAA
can you use a ubo/ssbo in a smol compute pass in between instead?
unless you do the fake MSAA thing with interpolateAtOffset (which requires conservative raster)
you want to write something to the stencil buffer
can this be replaced with a ubo/ssbo instead
ah thats also a thing
yeah the goal is to not emit fragment shader invocations somehow
forgot that shit wraps so this is a bad solution half the time 
well maybe there's a way to change the planes in that case
basically there will either be a large void in the middle that you don't want to render to, or a small square in the middle that you do want to render to
for posterity
idk how to efficiently compute the former bounds tbh. you can't do a simple min/max on the page coordinates
you could just coerce AMD to make sparse not garbage on windows
and all our problems would vanish
use deadly force if necessary 
guess I'll try culling individual triangles against the page hierarchy
then sw rasterize 
btw I fixed the HZB
vec4 project_screen_aabb(in aabb_t aabb, in mat4 transform, in mat4 proj_view) {
const vec3[] corners = vec3[](
vec3(aabb.min.x, aabb.min.y, aabb.min.z),
vec3(aabb.min.x, aabb.min.y, aabb.max.z),
vec3(aabb.min.x, aabb.max.y, aabb.min.z),
vec3(aabb.min.x, aabb.max.y, aabb.max.z),
vec3(aabb.max.x, aabb.min.y, aabb.min.z),
vec3(aabb.max.x, aabb.min.y, aabb.max.z),
vec3(aabb.max.x, aabb.max.y, aabb.min.z),
vec3(aabb.max.x, aabb.max.y, aabb.max.z)
);
vec2 min_uv = vec2(+3.402823466e+38);
vec2[8] semi_uv;
for (uint i = 0; i < 8; ++i) {
const vec4 clip = proj_view * transform * vec4(corners[i], 1.0);
const vec2 uv = (clip.xy / clip.w) * vec2(0.5, -0.5);
min_uv = min(min_uv, uv);
semi_uv[i] = uv;
}
vec2 min_xy = vec2(0.0);
vec2 max_xy = vec2(1.0);
for (uint i = 0; i < 8; ++i) {
const vec2 uv = fract(semi_uv[i] + min_uv);
min_xy = min(min_xy, uv);
max_xy = max(max_xy, uv);
}
return vec4(min_xy, max_xy);
}
for posterity too
is this for your boolean hzb?
ye
also, how did you fix biasing?
I am still only applying a mere constant offset 
there's a linalg version below
I impl'd this but it's fucked somehow and I didn't care enough to fix it
btw you also need to account for fp error (somehow)
I think I just added a constant 1.0 / (1 << 24) or something
you can still try it and what happens 😛
its because MSAA is a grid, just like using a higher resolution render target, so when you voxelize you can still miss a sample and have seams/gaps in your voxelization
sir @wispy spear can I share the sparse experiment in #experiments
There's a possibility that your system will lockup if you are on NV/Windows so maybe a warning should be necessary
of course, i dont think you need permission : )
https://github.com/NVIDIAGameWorks/Displacement-MicroMap-Toolkit would this be of any help in your meshletisms?
That's for their ray tracing extensions
Opacity and displacement micromap generation and stuff
ah so something completely different
i saw meshopt showing up in there and thought there was some meshopt-isms going on few weeks or months ago
Yeah though I get how it invokes the idea of meshlets when it says "micromeshes"
ye meshleading
I have faith
With the new batching & waiting method vkQueueBindSparse does NOT cause hitches
when updating lots of clipmaps
vkQueueBindSparse's perf still sucks though
but it's not as bad
I wonder
Maybe setting a limit of updated pages per frame won't cause a lot of pop-in if I can keep framerate up?
Nsight's sparse image viewer is pretty pog though ngl
wdym>
vkQueueBindSparse + vkQueueWaitIdle 
It is entirely possible that my previous setup with the timeline semaphore was wrong
I also used the graphics queue for everything, maybe using the transfer or the compute queue + timeline semaphore could give me more perf
well duh
I just gave up very quickly, this time I'll put in more work to make vkQueueBindSparse not suck and report back
Thanks to sharpneli & co. I now have a lot more info
I'd think its kinda self-evident that since the bindsparse happens on the GPU timeline its better to not wait for the queue to be idle XD
Ye, previously I was worried about the CPU overhead, but if you saw in the experiments room, if you don't spam vkQueueBindSparse calls the CPU overhead is basically zero or just wait for idle
shouldn't it be basically one queue bind sparse per frame?
Yes but even then, if you call vkQueueBindSparse once per frame, if enough frames pass you have a big issue
because the HW queue fills up
you have a big issue from frequent flipping of a page on/off
use temporal smoothing / fixed mem budget
yeah I have some ideas about that
one would be to wait for a page to be unused for K frames before evicting
I'll first try deferred page unbinding, if there's enough memory in the pool
ye
if you can spare it, always have the pages eat up all the mem (that you've set aside)
btw this strategy is nice because for every bind, you have a matching unbind
basically to page something in, you have to page something out
Hmm yeah that does sound good
also lets me never actually "unbind" (as is, submit a null VkSparseImageMemoryInfo)
You can just replace the page with a new offset, saving one bind operation
btw you can score all pages (Cause there's so few of them) on the GPU and run a GPU compute sort
(or CPU radix)
and if your mem budget is N pages, then you grab N most important ones
figure out which ones need to be paged in the most
btw even if a page is not needed, you shouldn't set its importance to 0
you can do something like 0-1 is how likely its come into view, and >1 is for visible pages
then out of the N, you figure out how many of those are not resident and grab the top K out of that if K is your "bind per frame" limit
nice that's a great heuristic
figuring out the most important pages is gonna be a chore though
perhaps I can score them using the 1 / clipmap_level and 1 / distance_to_camera heuristics
I have 256 and it's fine
alright I got hardware VSM back in full functionality
now with 100% more popin
I have reclaimed earlyZ and hardware filtering though
That one frame of latency is so sad
Look at ZROP go though 
ZROP being abused in unspeakable ways is always fun
This is with or without culling?
Without
16 clips?
16 clips ye
Wtf that's pretty good
It is
It's at least twice as fast as software VSM
if not more
depending on position etc
The horrible thing is the frame of latency
That is just so bad
I didn't really investigate real sparse what is the process you do now/how does it work?
(if you don't mind going over it ofc 🥹)
Sure thing
So first things first, the marking of visible pages remains the same
Now though, instead of sending it to another compute shader to allocate pages, we read it back on the CPU
Specifically I do this
auto bindings = std::vector<ir::sparse_image_memory_bind_t>();
bindings.reserve(requests.size());
for (auto page = 0_u64; page < requests.size(); ++page) {
const auto offset = ir::offset_3d_t {
.x = static_cast<int32>((page % IRIS_VSM_VIRTUAL_PAGE_ROW_SIZE) * IRIS_VSM_VIRTUAL_PAGE_SIZE),
.y = static_cast<int32>((page / IRIS_VSM_VIRTUAL_PAGE_ROW_SIZE) * IRIS_VSM_VIRTUAL_PAGE_SIZE),
};
const auto extent = ir::extent_3d_t {
.width = IRIS_VSM_VIRTUAL_PAGE_SIZE,
.height = IRIS_VSM_VIRTUAL_PAGE_SIZE,
.depth = 1,
};
if (is_requested && !is_allocated) {
const auto entry = _allocator.get().allocate();
bindings.emplace_back(ir::sparse_image_memory_bind_t {
.offset = offset,
.extent = extent,
.buffer = _buffer->slice(memory_offset, IRIS_VSM_PHYSICAL_PAGE_RESOLUTION, false),
});
_pages[page] = entry;
_allocated++;
} else if (!is_requested && is_allocated) {
bindings.emplace_back(ir::sparse_image_memory_bind_t {
.offset = offset,
.extent = extent,
.buffer = _buffer->slice(memory_offset, IRIS_VSM_PHYSICAL_PAGE_RESOLUTION, true),
});
_allocator.get().deallocate(entry);
}
}
return bindings;
This takes in requests which is the read-back array
and outputs an array of sparse_image_memory_bind_t
Which is equivalent to VkSparseImageMemoryBind
It basically takes in the offset and extent of the our virtual image and assigns it to a certain offset of a VkDeviceMemory
The next step is to send this info over to vkQueueBindSparse which is just this
for (auto i = 0_u32; i < IRIS_VSM_CLIPMAP_COUNT; ++i) {
const auto requests = vsm_visible_pages_buffer.as_span();
// std::vector<ir::sparse_image_memory_bind_t>
const auto bindings = _vsm.images[i].make_updated_sparse_bindings(requests.subspan(
i * IRIS_VSM_VIRTUAL_PAGE_COUNT,
IRIS_VSM_VIRTUAL_PAGE_COUNT
));
if (!bindings.empty()) {
_sparse_bind_info.image_binds.emplace_back(ir::sparse_image_memory_bind_info_t {
.image = std::cref(_vsm.images[i].image()),
.bindings = std::move(bindings),
});
}
}```
And then it's just a good ol' vk call
sparse_timeline_value = _sparse_bind_semaphore->increment(2);
_sparse_bind_info.wait_semaphores = {
{ std::cref(*_sparse_bind_semaphore), {}, sparse_timeline_value }
};
_sparse_bind_info.signal_semaphores = {
{ std::cref(*_sparse_bind_semaphore), {}, sparse_timeline_value + 1 }
};
_device->compute_queue().bind_sparse(_sparse_bind_info);```
I use timeline semaphores such that:
- if there were no previous sparse binds, simply signal to the graphics queue once it is done
- If there was a previous sparse bind, wait for it to be done and signal the graphics queue
Finally, sampling is super easy
float sun_shadow = 0.0;
const int vsm_residency = sparseTextureLodARB(u_vsm[virtual_page.position.z], virtual_page.uv.xyz, 0, sun_shadow);
if (!sparseTexelsResidentARB(vsm_residency)) {
sun_shadow = 1.0;
}```
Hmmm so the latency is only due to the readback?
Yep
I can't even do anything about it because I need to update the sparse bindings though vkQueueBindSparse
Right okay correct me if I'm wrong but the popping comes because you try to read sparse memory at a clip level where nothing is located since the requesting logic is the same as sampling logic. Aka it tries to read clip 3 while the memory is sill bound to clip 4?
Due to the single frame lag
It tries to read clip 3 but the memory is unbound
Right now I immediately unbind all pages that are not request in the current frame
so if in the previous frame some page wasn't allocated, but the current frame says "now it is allocated boi", I sample an unbound page
The only way to fix it is to have no frames in flight (and stall before sampling) 
Hmmmm
I know how to fix the popping due to clip level switch
You just delay the clip level you sample by a single frame
But disoclussion and edge of frame will still pop
Okay I have a cursed idea
What if we tried to combine fake sparse and hw sparse
how so
We use hw sparse for pages that are only switching lod levels - since you can just delay the sampling as I described. But for pages that cover pixels that were previously not visible at all we do fake sparse with no delay
Then once these pages will switch lod we move them to the hw sparse path

In your vsm page table you'd need to store info if a page is real or hw sparse to know what to sample
Ah but this would double the amount of clipmaps we have to draw
Double the clipmaps but with efficient culling that's not a problem 
The culling is also the same for non resident pages so we could do two step culling - first share cull for both real and fake and second cull again now only for pages of real and fake separately
But I feel like this is extremely cursed
Let me try getting rid of the timeline semaphore and just stalling the device
because that's what good people do, they stall the device
Ok I got a bsod
very good
but the textures look like shit, some washed out BDU pants
This debug view is so good
In white are the active pages relative to what the shadow map sees
i'm super interested in the final perf
adding hpb really helped mine
would you say
your current sparse approach is getting close to nvidia's GL driver impl of sparse?
Nah
Mine is very rudimentary right now
It's just "update what changed and semaphore"
No deferred or staggered updates or stuff like that
I hate readback with a passion though 
let's hope for this
save us from the readback
gpu driven sparse management
absolute garbage
I wish 
while we're at it, why not make it also perform well out of the box
ah rasterization, teh paradigm where you heroically toil to overcome problems unknown in raytracing
this scares me tbh
like from your experiment it sounds like it might be possible, but also seems very tricky to get the API to not obliterate performance
It's tricky but completely doable
Right now I have the most naive implementation possible, and vkQueueBindSparse takes at most 10ms
When updating across all 16 clipmaps
Also, viewing clip 2 in nsight makes me have a device loss error?
mfw
The fact that you have to have no fif makes it really bad though
I'm not sure there is a way to overcome that
yeh but your shadows are scuffed
you need 1 frame in flight and issue a complete queue stall
you need to do both to not have scuffed shadows
graphics queue stall?
ye
while (is_running()) {
mark_pages();
readback();
wait_idle();
update_bindings();
render_shadows();
}```
so no frame in flight?
no frame in flight
nono, you can omit the wait idle, you just have to work with an older readback then
the wait idle is there to complete the transfer from device to host
thats fine tho
waiting for transfer?
delayed readback
ah yeah
you get this though
We delay the user input for FIF and than the readback will be fine
just show a loading screen
or some loading.gif on a quad
starfield did it
cod did it
why not showcase_vsm.exe
we have cool shadows but you cannot move or rotate camera or else loading screen
I am so tempted to actually put a stall and see what happens 
father forgive me
no more scuffed shadows!
don't mind the giant hole though
Totally not issuing a full stall 💀
so good
The pop in would probably be less bad if you cached pages that weren't just on screen
there'd still be popping when you see new pages so meh
I already have a full stall in my code rn so 
by writing wait_idle() I have forsaken my humanity Jaker
well actually most of the demos of fucked up pages are fixable:
- fast camera turning, second time you turn pages are there
- use sampler feedback (implement your own) and if page not resident, sample a higher mip-map which is
the annoying thing to compensate for are disocclusions
run a small blur/average over your pages to score them ? So that way a not-right-now needed page feels important cause it has resident neighbours
Or just use a GPU allocator for zero frames of disocclusion lag 
yeah but then no HiZ
and HW rasterizer neatly dropping writes to non-resident pages
yeah but you actually rarely have to draw vsm pages so it's not a big deal
moving the camera causes a lot of cache invalidations though and perf dips quite a bit
that's why you don't move the camera
unless the sun moves 😛
if the sun moves it's joever
also you know, if dynamic occluders within the pages move too XD
Nuh uh
I think unreal is trying to move to a dual vsm mem pool solution to handle dynamics
Then they just merge dynamic with static cache
Ye
but then your sparse bind takes 2ce as long
from what I can see your biggest bottleneck is the 1 frame lag + how many pages you can bind per frame
not the actual drawing, which is hilarious
Though I know mac doesn’t support sparse and I think ue5 runs there right? Maybe they’re using software sparse
They have Nanite they don't need real sparse
huh? Mac used to be AMD Radeon since like 2010-ish ?
and AMD has sparse since then
Oh they ditched amd
tbf Nanite is the point at which I'm like... fuck rasterizing
At least on apple silicon it reports no sparse in the vulkan support viewer
Also I'm pretty sure you still need screen space shadows and other stuff because otherwise VSMs just die when you have stuff like moving grass and swaying trees
ofc we all know that but in order to compete M1 and M2 GPUs would have to step into AMD's shoes
you need SSS regardless of those because of LOD 
Because of LOD?
lod transitions as you move
ye, iirc nanite uses LOD for their VSMs too, so when there are mismatching LODs between views, they run a screen space trace to fix the shadows

Lod transitions are basically unnoticeable in VSM unless the lod bias is high
but then rip LOD
Yeeah
build yourself a BLAS
It was fun while it lasted but I'm starting to be a doubter
in the time you've spent fucking around re-implementing a SW Rasterizer or optimizing for the bottleneck of Sparse Bind, you could have made a BLAS builder
They have rt as the highest quality settings
probably the SW+Nanite kind then
HW render will just die on multiple lights
there's only so many renderpasses/subpasses you can churn through in a frame
and its not like you can have a separate viewport per layer
I guess you could do a layered render, if you kept all VSMs same size
Rt for non nanite, vsm for nanite
unreal tells us that we can have as many lights as we want because all VSMs are 16k and their culling is 100% effective
I always recommend #ray-tracing
as in, inactive pages do not contribute to any cost
do not what UE was doing last year, do what UE will be doing next year
They only call Nanite::Rasterize once, it rasterizes all viewports
yeah its possible to do in SW
I would but idk how potato gpus will like this
a bit less in HW (viewports need to be binned)
I actually don’t have any non potatos to test with
don't write for today's potatos, write for tomorrows RTX x090 Ti
yeah light will make a certain amount of pages active
Rip to my setup 
also each light requires that you analyze the pixels on screen and vote on which pages should be active
so by that virtue alone, there's a limit on lights
and I think it might be in the low tens per pixel (counted by area/volume of effect, not visibilit/contribution)
Hmmm I wonder if they publish that limit anywhere
but they have a cost when you're meshlet culling
Idk how many virtual lights they allow
its empirical
Yeah I meant like
ye that's the only issue
I wonder if they have a “best practices” section for point lights
when FPS drops to 5, there's your limit
you could do conservative culling to improve perf
as in, cull only the N most important meshlets this frame
there's no point in having a global limit, as not every light is in-frustum and not every light affects (or could affect) the same number of pixles
hey guess what, its called importance sampling
and you gib a fixed budget (of rays)
Checked their docs
They currently don’t support per-light resolution controls and it looks like they don’t yet expose a page update budget control
ye they only do 16k
So this is probably being generous
But they’re both listed as “we’re working on it” so maybe it’ll get a bit better soon for non directional
non directional will completely wreck your page occupancy
persp projection blows up anything near the light source
and suddenly all your pages are active 
VSM works with directional lights cause the projection is ortho
actual lights in the scene there really isn't a question of "are there any empty areas in our shadowmap"
more like "what LoD should be page them at"
With proper mipmap usage, that won't require you to have stupid texel density
The virtual mip chain you mean?
like you can easily get into a situation where what the camera sees covers 80%+ of the shadowmap
more like "what LoD should be page them at"
thats what I'm getting at here
@wicked notch I have a silly way to optimize your HiZ
I'm all ears
well not HiZ but early Z
do you need to draw your meshlets for different mip levels separately ?
for the higher (low res) mip pages, don't draw geometry (and cull) in parts that are overlapped by higher res resident mips
I could theoretically squash them into one
But then I hit the max numer of meshlets that I can rasterize with that func
Which is 1 mil for some reason
you can downsample the higher pages into the lower pages
a resident high level page covers 1/4 of the page immediately below it (whether the lower one is resident or not)
hmm I guess you can improve both HiZ and earlyZ 🤣
basically only rasterize geo for resident 1/4-pages that don't have a resident page directly beneath them in the mip-chain
that way you won't be drawing any meshlets to the highest mips
at all
btw you don't need to interlave compute/FS, you can do the downsampling all at the end after everything has been rendered
clever, eh?
hm
I suggested something similar way back, but not with the rasterize 1/4 that were not resident, I wanted to do downsample when you switch clip and the higher clip is fully covered by the lower clips
I'm not sure it will bring that big of an improvement thought
worth trying
I have implemented deferred binding updates as well as batching
However, there's a very sad fact
Updating a couple dozen pages takes 5ms
let it go m8
no
more tricks are to be done
I really wanna do caching
but my brain is too small
Jaker can I trouble you to explain how to snap the camera to page offsets once again
also how to mark pages dirty
🥺
I mean the API sparse lol
ye
sounds not worth
I'm fully convinced that caching will let HW sparse shine
I just don't really understand how to do caching

if page was allocated and hasn't been rendered to, it is dirty
I set this bit in the allocator
What if a page remains allocated for two frames in a row but the camera snaps to another offset
what part do u need explain't
I don't see how tbqh
the snapping, how do you decide when to move the origin of the clipmaps, how do you calculate that
the beauty is that there is no "decision" as in an if statement
but ye lemme show u
do you have VirtualShadowMaps.cpp open already
the function in question is DirectionalVirtualShadowMap::UpdateOffset
I can explain each line
I'll post it
this will be the reference I guess
void DirectionalVirtualShadowMap::UpdateOffset(glm::vec3 worldOffset)
{
for (uint32_t i = 0; i < uniforms_.numClipmaps; i++)
{
// Find the offset from the un-translated view matrix
uniforms_.clipmapStableViewProjections[i] = stableProjections[i] * stableViewMatrix;
const auto clip = stableProjections[i] * stableViewMatrix * glm::vec4(worldOffset, 1);
const auto ndc = clip / clip.w;
const auto uv = glm::vec2(ndc) * 0.5f; // Don't add the 0.5, since we want the center to be 0
const auto pageOffset = glm::ivec2(uv * glm::vec2(context_.pageTables_.Extent().width, context_.pageTables_.Extent().height));
uniforms_.clipmapOrigins[i] = pageOffset;
const auto ndcShift = 2.0f * glm::vec2((float)pageOffset.x / context_.pageTables_.Extent().width, (float)pageOffset.y / context_.pageTables_.Extent().height);
// Shift rendering projection matrix by opposite of page offset in clip space, then apply *only* that shift to the view matrix
const auto shiftedProjection = glm::translate(glm::mat4(1), glm::vec3(-ndcShift, 0)) * stableProjections[i];
viewMatrices[i] = glm::inverse(stableProjections[i]) * shiftedProjection * stableViewMatrix;
}
uniformBuffer_.UpdateData(uniforms_);
}
so first off, we are calculating a separate offset for each clipmap (since each one has a different page size)
the offset we want needs to be a multiple of the page size for a given clipmap
this line calculates the viewproj of the clipmap as if it were locked to the origin (it still has a rotation component)
uniforms_.clipmapStableViewProjections[i] = stableProjections[i] * stableViewMatrix;
How does a stable projection differ from a non stable one
at one point I was offsetting the projection matrix rather than the view matrix, but that fucked up math for other stuff later on, so I removed it
now I just explicitly call them stable so I'm certain about what I'm looking at
the page offset will be computed with this matrix btw
it provides a reference point (coordinate space?) I guess
oh, btw, worldOffset is the position of the player camera. we are trying to center the clipmap on it
hm
these three lines are transforming the player coord to [-0.5, 0.5] space (half NDC space?) of the stable clipmap viewproj we just made
const auto clip = stableProjections[i] * stableViewMatrix * glm::vec4(worldOffset, 1);
const auto ndc = clip / clip.w;
const auto uv = glm::vec2(ndc) * 0.5f; // Don't add the 0.5, since we want the center to be 0
note that it's perfectly fine for the resulting coord to not actually be in [-0.5, 0.5]. that just means it's not within the frustum of the stable viewproj (which is highly likely for the smaller clipmaps)
to get the all-important offset, we just multiply that "uv" by the number of pages in the clipmap
const auto pageOffset = glm::ivec2(uv * glm::vec2(context_.pageTables_.Extent().width, context_.pageTables_.Extent().height));
(btw it might be better to round than to truncate, idk)
anywho, this offset tells us how many page widths we need to translate the clipmap camera to be approximately centered on the player
I think if you trunc you will get bad behavior when when you switch from positive to negative no?
since -1 -> 0 <- 1
at worst you'll be off by one page which probably isn't noticeable ever
what I mean is that the camera won't be perfectly centered on the player
it's not like the shadow will be wrong
that's why it's probably impossible to notice unless you are somehow looking at every page in the clipmap
anyways
back to the explanation
the projection matrix allows us to conveniently apply a shift in NDC space by translating it
const auto ndcShift = 2.0f * glm::vec2((float)pageOffset.x / context_.pageTables_.Extent().width, (float)pageOffset.y / context_.pageTables_.Extent().height);
const auto shiftedProjection = glm::translate(glm::mat4(1), glm::vec3(-ndcShift, 0)) * stableProjections[i];
we are only interested in translating it on XY because we don't want depth to get fucked when the player moves
so we're basically sliding the bad boys on a plane
btw I originally tried right-multiplying the projection by the translation (putting the projection inside the glm::translate call), but that broke somehow 
pageOffset / numberOfPages generates a UV-space value
but projections work in NDC (actually clip space but yolo) I guess
so ye u are correct
if it worky it worky
here is the final line in the loop
viewMatrices[i] = glm::inverse(stableProjections[i]) * shiftedProjection * stableViewMatrix;
ok I think that line is stupid
I mean it works
Yeah I dunno what the hell is going on here
I'm basically extracting the shift from the shiftedProjection (undoing all the projection parts) and then applying it to the view matrix
so it's just translating the view matrix 
so this line is stupid because the view matrix already has this property
except it's view space instead of NDC, which is trivial to convert to
the last three lines could probably be replaced by viewMatrices[i] = glm::translate(stableViewMatrix, glm::vec3(pageOffset * frustumSize, 0));


