#Iris - A Journey through OpenGL and beyond to learn Graphics
1 messages · Page 19 of 1
that doesn't work like that in all companies 💀
they decorate everything with perprimitiveext BUT the builtin variables
if slang was intel and intel had a driver bug they'd probably add a workaround to the compiler
so gl_PrimitiveID and gl_CullPrimitiveEXT are not marked
ahaaa
what is primitive id
gl_MeshPrimitivesEXT[id].gl_PrimitiveID
I use this to output primitive id to visbuffer
it's probably the same thing, one is per vertex and one is per primitive
idk
gl_CullPrimitiveEXT tho 
yea that is a problem
from reading opengl spec it seems you are not supoosed to erite that
its just an input for some shader stages
perprimitive?
oh you mean gl_PrimitiveID
yee
I mean
perprimitiveEXT out gl_MeshPerPrimitiveEXT {
int gl_PrimitiveID;
int gl_Layer;
int gl_ViewportIndex;
bool gl_CullPrimitiveEXT;
int gl_PrimitiveShadingRateEXT;
} gl_MeshPrimitivesEXT[];``` vk spec says this is writeonly
tbf I also don't really get the point of mesh shader gl_PrimitiveID beside some weird compat with frag shaders that were consuming/expecting gl_PrimitiveID
yea
btw I found a use case for secondary command buffers
you can defer decisions like loadOp, storeOp and image layouts until later point in time
but cant you do that with normal cmd buffers too
tbh my abstraction is just kinda weird
but secondary cmdbufs turned out to be rather handy for it
I never understood the secondary command buffers
I thought they were what work graphs are meant to be

do you mean indirect commands
secondary command buffers are like normal command buffers except they need less information before you can record a draw
Ah okay I was heavily misinformed then
I thought secondary command buffers could be dispatched from primary ones
Hmmmm
the microsoft employee side of you is leaking
me on my way to rename descriptor_buffer to descriptor_list
ok good
tbh one thing I'd like if memory management for cmdbufs was more explicit
CommandPool feels ugh
can I just please provide a callback that driver will call every now and then 
real
btw
VK_EXT_nested_command_buffer is a thing apparently
so you can recurse with command buffers very deeply
idk who asked for that, but
lool
John Daxa probably
Common misconception
He actually is called "Join Daxa"
Jane Daxa
my bad i meant Jérôme Daxa
https://github.com/KhronosGroup/SPIRV-Tools/issues/4919 turns out, there was already an issue
it's been sitting there for 2 years
thank you John Khronos 
in the default implementation yeah but I have a custom one which dynamically creates a set with all required images
hmm no that works
all my textures are SRGB generally
did you see this already lvstri
https://themaister.net/blog/2024/01/17/modernizing-granites-mesh-rendering/
Well, you have a custom backend, so you probably do some sRGB trickery
But also try drawing some ColorEdit and see if displayed color matches there
Also you can open ImGui's style editor in DemoWindow and color pick the colors to see if they match
Simple sRGB-cope implementations usually have subtly wrong colors
(especially FrameBg, FrameBgHovered etc.)
I just convert the vertex colors to linear
@wicked notch have you started software raster? Do you know why nanite forces HW raster for triangles that need to be clipped to the screen?
clipping is faster in HW
Isint it just a min/max of each screen space vertex to clamp to the screen bounds?
no
if it was just a min/max clamp you'd change the shape of the triangle
I don't even think this gives you clipped triangle
yeah exactly
Ah I see
Oh wait I misunderstood what this tutorial I'm reading is doing
They were clipping the aabb for the triangle, not the triangle vertices
https://cent.felk.cvut.cz/courses/PGR2/lectures/10-architekturyII.pdf
Look this if you're interested
Starting Slide 46
Thanks!
why are paper filenames always ass 😦
it is a lecture slides thingy
hmm I don't see the issue here. It is a lecture 10 - which is about GPU architectures (architektury in Czech) part II
I thought you liked funny spellings
Does anyone understand what nanite talks about when it comes to solving for the x interval for software raster?
I suppose I should read the PDF they link as reference [71] 😛
Nope I read through cure's GitHub project and can't actually find what they're doing
@wicked notch how do you decide between SW/HW raster per cluster? Nanite presentation says all 3 triangle edges < 32 pixels long. But uhh, how do you determine that at the cluster level? Compute some kind of average triangle edge length per cluster?
And then what, SW rasterize anything with a small viewport area?
yes
Makes sense, thanks
if the aabb is less than 32 pixels in each direction
I'm almost ready to do software raster, I just need some final wgpu changes to do image atomics. Going to use buffers in the meantime.
I also need to figure out how to rework my culling code to feed into the SW/HW raster. Probably switch to two atomic lists of cluster IDs.
And then indirect dispatched on the SW raster / hw index buffer write.
yes
thats how we will get LVSTRINV soon
LPSTRI
luigi? i think you should try to get into novideo
after school
just like pixel went to valve
having a frog at important spots in the industry helps to fight John Khronos
you're asking me to go to the dark side
explicitly
why is nv dark side
true that
my megacorp is better than your megacorp
@wicked notch I move here to not pollute Vulkan
I don't get it
do you do one dispatch per task shader workgroup?
or per subgroup?
(my actual knowledge of mesh shaders is limited)
workgroup is 128
subgroup is 32
when I call vkCmdDrawMeshTasksEXT(1, 1, 1) I dispatch one task shader to cull 128 meshlets
if all of those survive I will have 4 mesh shader commands (EmitMeshTasksEXT()) each with a workgroup dispatch size of 32
what I'm trying to do is reduce the number of EmitMeshTasksEXT calls so that NV command processor doesn't commit sudoku
okay so you still cull 32 meshlets per task subgroup?
uh
I'll just tell you what we do in Tido
each task shader invoc culls 32 meshletts, we get base offset, subgroup offset (I think) and survivor bitmask
then we dispatch mesh shader (32 threads) for each surviving meshlet
yeah survivor bitmask sounds like a good idea
what's the survivor bitmask for
to know what meshlet mesh shader works on
you already know that don't you
replaces your deltaIDs from your payload
nth mesh dispatch looks for nth set bit in the bitmask
so you do an exclusive workgroup bitcount?
if there even is an intrinsic for that
maybe it's only subgroup bitcount
we cooked with Patrick
wait
func wave32_find_nth_set_bit(uint mask, uint bit) -> uint
{
// Each thread tests a bit in the mask.
// The nth bit is the nth thread.
let wave_lane_bit_mask = 1u << WaveGetLaneIndex();
let is_nth_bit_set = ((mask & wave_lane_bit_mask) != 0) ? 1u : 0u;
let set_bits_prefix_sum = WavePrefixSum(is_nth_bit_set) + is_nth_bit_set;
let does_nth_bit_match_group = set_bits_prefix_sum == (bit + 1);
uint ret;
uint4 mask = WaveActiveBallot(does_nth_bit_match_group);
uint first_set_bit = WaveActiveMin((mask.x & wave_lane_bit_mask) != 0 ? WaveGetLaneIndex() : 100);
return first_set_bit;
}
returns the position of nth set bit in a bitmask, uniform for entire subgroup
why would you do this instead of just using subgroupBallotExclusiveBitCount 
no spirv intrinsics?
but even if you did exclusive bit count
how do you know what thread to broadcast from
like the thread itself knows sure
but how do the others know
interesting question
intuitively, each subgroup invocation calculates its own exclusive bit count
so instead of a ballot exclusive bit count, you do a regular bit count on the mask
well you want a subgroup per meshlet
so all 32 threads should agree on the meshlet
no?
or do you have a different mapping
the exclusive/inclusive bit count would just replace the WavePrefixSum and two lines above
so it's a workgroup bit count
the ballot and min you still need
why do you use such large workgroups
nv recommends 32 still
i also always lost perf on larger groups
the survivor list reduced the lsb pressure quite a bit that was very nice
larger groups are probably better when you have larger payload
i see now why you did it
the idea was to reduce the number of invocations needed to cull and draw
mesh workgroups are fake on nv, it's just a single hw subgroup cycling through api subgroups, compiler munches your shader
wait wat
can I at least expect gl_SubgroupID to work as I would expect it to work?
when workgroup size is multiple of subgroup size
yes
what bird dir you switcher this forbidden knowledge
yes, that's just an impl detail
nvk people
nice
well its kinda obvious given the NV extension is always one dimensional
slang your beloved (NOT MINE THOUGH)
they really are sweet hearts
they just implement it
i love them
cope cope cope
^ you when a slang release drops (at least this one)
@glass sphinx do you have docs on spirv intrinsics for slang
trying to review a slang patch and it makes my brain boil slightly
afaik there are only tests
I wonder how nano copes with email
I just re-read my thingy many times before sending
Same
but also by not sending any emails often
I hate email
yeah it sucks
agree but I like receiving notifications on my issues/MRs/whatever in a centralized box
very handy
Sure but I mean for direct communication
yeah for direct communication it kinda sucks
Especially when mfs splinter threads and then the inbox becomes impossible to navigate
(might be a client/skill issue idk)
yes, skill issue
plus all the signature junk polluting every thread
what client is this
thunderbird
never used outlook
I use it for work unfortunately
I did use Thunderbird once. Perhaps I should use it again
considering outlook is supposed to be backed by a big chungus corporation it probably has comparable functionality
so mayhaps just do a bit of googling
do you use folders
or do you just look at your inbox
create folder + setup a filter by something
Damn email powerusers
It's necessary when you get dozens of emails every day 

I contribute to one repo and they have the GitHub actions setup to try and build after each commit for like 15 different setups. So after each commit you receive 15 emails of the build succeeding or failing
hmm I only get emails when CI fails in my thing (Fwog)
Or maybe it's that, but their gh actions are broken so everything fails 🥸
I don't really read them, because it was just garbage every time I did
900k+ lines of compiler errors
Yeah, takes me hours to write anything, I hate it too
i always send cryptic shit
if end to end encription gets outruled at some point im safe
i trained all the ppl i know to decypher my shit
noone else will understand
My depth down sampling is missing some pixels on npot textures
And therefore the way I project bounding spheres to screen space is wrong. Ahhh.
https://github.com/LVSTRI/Retina/blob/508dfb0ba11e6ffaba7633c07d2f5613e06472b9/src/Retina/Sandbox/Shaders/GBufferResolve.frag.glsl#L145 @wicked notch mister what is this cope? 
it's what one does to avoid pagefaulting 
though tbh I don't think adding a page request is too bad
hmm with PCF this is easy, you just request the PCF region to be in the same clip, but I have no clue about how to do it with pcss
same thing except it's max(blockerSearchRadius, maxFilterRadius)
right, but those can be massive no?
I don't follow
when filtering, if a page is not allocated you just emplace an allocation request in the list
to be drawn next frame?
ye
When working with mesh shaders, the geometry needs to be split into meshlets: small geometry chunks where each meshlet has a set of vertices and triangle indices that refer to the vertices inside each meshlet. Mesh shader then has to transform all vertices and emit all transformed vertices and triangles through the shader API to the rasterizer. ...
@primal shadow you're building your own version of SPD right?
can you pin me the discussions you had with devsh, I can't find them
No I'm not working on anything ATM. I'm on vacation and don't have my desktop, and my laptop is broken.
rip
#graphics-techniques message
Depth down sampling for the initial pass is hard though. I messed up and need to fix it.
halving time for hzb copy + reduce, 'ery nice
150us -> 80us
now I can spam this for all 16 clipmaps
that will be 1.3 ms
yes 
because amd cannot comprehend the existence of operating systems other than windows
this is kinda sorta single pass
except it's 3 pass 
oooff
wait but you can just copy the tido single pass downsampler
it just doesnt properly work on the bottom and right edge if you have a non pod res
yea
nice
how long does it take to downshrimple this image?
assume it's just single channel float
i was thinking of rescaling the lowest mip to some pot size
~20 microseconds
on 4080
fully memory bound
hmm
for me it takes 30 microseconds to copy time image and 30 microseconds to fully downsample it
does that include the initial reduction
no
ok so your thing is 2.9x faster
arguably i should remove the writeout for the lowest 2-3 mips not just mip0
culling on that fine scale is unnecessary
also slower even potebtially
i need to test it again
on a nuclear GPU
btw saky
hear me out
when drawing the vsms
I bind temporary virtual resolution depth texture to get the sweet early Z
@glass sphinx do you have code to share? I realized my down sampling does not work for npot textures :(.
now let's say I build HZB out of this temporary depth texture
in the initialization step, I query for the active page table, and write 0 if page is inactive, or the depth if it is active
then I reduce the thing and cull
culling outputs a bitmask of visible meshlets
this repeats for each clipmap
downside is subchannel switch
uhh
Ty, I will steal and port to wgsl 🙂
so you cull your lower clips with the deph written out by the higher clips?
I don't fully follow
also it makes it so that you have to barrier between each clip draw
16 barriers each frame
yucky
no I don't need barriers
or yes I do
but only for the depth texture
transitioning from undefined because I discard the results
I switched to 2 bits per meshlet btw. 0 = not eligible (wrong lod, instance culled, failed frustum cull), 1 = first pass draw (passed last frame occlusion cull), 2 = second pass candidate (failed last frame occlusion cull), 3 = second pass draw (passed current frame occlusion cull)
well you need a full clear no?
yes
yeah
but it's accelerated so it doesn't count
they don't but I'm hoping for hiz culling to make up for the difference
Also I asked this question earlier and didn't get a response, do we really need to do an explicit frustum cull for each meshlet? We have to occlusion cull anyways, which involves projecting the culling sphere to a screen space aabb. We can just check if it's valid, no?
mine also doesnt. I still need to add a variant that respects borders
I don't understand how you'd use the temp z for hiz cull
like you draw the biggest clip, and cull the lower one against hiz from the biggest clip?
ie draw clip 8 - build hiz of 8 - cull clip 7 agains 8 - draw clip 7 - build hiz 7 ....
There's https://miketuritzin.com/post/hierarchical-depth-buffers, but idk how optimized it is.
for clipmap in clipmaps {
Clear(tempDepthBuffer);
Draw(tempDepthBuffer, previousFrameVisibilityMask[clipmap.Index]);
Copy(tempDepthBuffer, clipmap.HZB); // also performs an initial reduction, checks whether the current depth tile
// is active in the page table, if it isn't active the depth written is going to be 0
Reduce(clipmap.HZB);
Cull(clipmap.HZB, currentFrameVisibilityMask[clipmap.Index]);
Barrier();
}```
right, but I don't see how this is different than doing it without temp hiz?
you can just as well dras previousFrameVisibilityMask into the VSM itself
reduce the VSM
and cull against that
you don't really need the temp z for this no?
or am I missing somehting again 
now that I think about it you're right I don't
I can just make a dispatch over the full virtual resolution and each thread checks for the virtual page table, gets physical texel and write that (or 0 if page isn't active)
this way draws overlap too
and we can do that inplace in the physical memory itself, (we just make the physical memory have mips, that way you can reduce over the physical memory)
ye but I don't understand how to cull against the actual VSM with mips 
I mean there will be some index math but I don't see how it would not be doable
conceptually it is all the same no?
It will be a bit more involved yeah
thinking about it, a quad could span 4 different physical pages
yes
(one per corner)
so you need to go from NDC -> VSM_PAGE -> PHYSICAL_STORAGE
or, over some mip cutoff just NDC -> VSM_HIZ
it's just an indirection
(I'm saying it super nonchalantely, as like I won't run into 50 bugs trying to implement this)
btw I found the button to enable this view in outlook (show as conversations) 
look what came in the mail today https://jglrxavpok.github.io/2024/04/02/recreating-nanite-runtime-lod-selection.html
please make a blog, I don't wanna read twitter clones
posting short messages can be easier for the author to write
jokes on you i procrastinate doing either
so it's a choice between lustri not writing anything (and you thus not reading) vs reading something that's perhaps not very quality
one weird trick
bro
leave it up to lustri to decide
true, I should probably give in and make a memestodon account so I can continue passively trawling for GP info
just pick an instance that doesn't tolerate and/or enable morons
or other instances will block the instance you've picked which might reduce the quality of your experience
@dull oyster
fully but why do you do this:
vk::DeviceAddress vertexBufferAddress = (vk::DeviceAddress)-1;
vk::DeviceAddress indexBufferAddress = (vk::DeviceAddress)-1;```
why not just 0
0 is null & invalid
not Twitter 
I use my own blog on GitHub pages. Took like an hour to setup.
Honestly? Don't really know why I did that
I really should sleep because my eyes are failing me but the vsm HiZ expiment is almost done
allez dodo! 🦤
const uint cornerX = index & log2(position);
const uint cornerY = index >> log2(position);```
mmm yes

are you doing two-pass hiz
just single pass rn
because I think you either have to do that or make pages active for two frames so geometry can render properly
that does make sense
I tried one-pass hi-z and didn't think of that issue, then it was broken and I deleted it all 
ye I'm willing to accept some artifacts for now
I'll do two pass after this works properly and I get the perf boost I'm expecting
buckle up because I'm about to create a 2048x2048x16x4 image
Does it at least go brr when drawing?
I will test soon
meanwhile
float SampleVirtualShadow(in uvec2 position) {
const uint power = findMSB(VIRTUAL_SHADOW_PAGE_SIZE);
const uvec2 virtualPagePosition = position >> power;
const uint virtualPage = imageLoad(g_VirtualShadowVirtualPageTableImage, ivec3(virtualPagePosition, gl_WorkGroupID.z)).x;
if (!VirtualShadowIsPageBacked(virtualPage)) {
return 0;
}
const ivec2 physicalTexelCorner = VirtualShadowCalculatePhysicalTexelCorner(virtualPage);
const ivec2 physicalTexel = physicalTexelCorner + ivec2(position & uvec2(VIRTUAL_SHADOW_PHYSICAL_PAGE_SIZE - 1));
return uintBitsToFloat(imageLoad(g_VirtualShadowPhysicalMemoryImage, physicalTexel).x);
}
RetinaGroupSize(16, 16, 1)
void main() {
const uvec2 position = gl_GlobalInvocationID.xy;
const uvec2 virtualPosition = position << 1;
const vec4 samples = vec4(
SampleVirtualShadow(virtualPosition + uvec2(0, 0)),
SampleVirtualShadow(virtualPosition + uvec2(0, 1)),
SampleVirtualShadow(virtualPosition + uvec2(1, 0)),
SampleVirtualShadow(virtualPosition + uvec2(1, 1))
);
const float v = min(min(samples.x, samples.y), min(samples.z, samples.w));
imageStore(g_OutputImage, ivec3(position, gl_WorkGroupID.z), vec4(v));
}
``` can I make this better in any way
I stall on TEXTHR though so I doubt that 
On phone
Will check when I'm back on pc
2ms shadowmap draw seems fine no?
(from the capture )
that's without hzb, I will make culling go brr now
This is bistro 16 clips?
nono sponza
because it loads faster so I can test without waiting 
mfw no async load
Would be pog if it was bistro
Ye it's pain to do
But fun when it works
Launch times are instant
Hmm, how much faster is Sponza compared to bistro?
a lot
but it's not comparable to regular raster either way
I think I can make the copy go faster if I abuse the fact that page size is 128
I just allocate a ton of shared memory and put the samples there
Okok, was just wonderin
I reduce memory traffic
memory subsystem is happy
I am happy

copy takes literally 3x less time
hell yeah
At the mere cost of 380 megabytes of VRAM we get hiz
less than what bistro would take in an acceleration structure 
plus this is the dumb shitty naive method
if we reduce the VSM itself we get infinite power
The if is the important part 
I believe in you
ight moment of truth
Speed
I'm rendering at the blistering pace of VK_ERROR_DEVICE_LOST millseconds per frame
I'm curious btw, are validation layers useful to you?
somewhat
Recently I can't really do anything other than shader print with them
Yeah, GPU based just crashes or misdiagnoses so the app doesn't run properly
Sync doesn't get bindless
And everything else is done by daxa
holy mother of all pog
it is working
ladies and gentlemen
800 microseconds raster on bistro
it takes longer to build the HZB 
If we combine it with hpb cull and caching it will be light speed
this is already hpb
Do you know if it actually works?
I write far depth into non backed pages
Like do you have Shadows?
Aha I see
Crossing my fingers
nop
Hmm weird
do you query memory budget every frame for those statistics
Btw is it two pass?
one pass
So you just keep depth from previous frame and cull against that or?
no I use previous frame visibility mask
I do that because it's easier to switch to two pass this way
previous frame visibility mask to cull current frame meshlets
it's a bogus approach
but it makes it easier to do two pass
Like I only have to do another culling pass and xor the visibility masks together
Hmm okay what is visibility mask? I thought it's just a bit mask of visible/notvisible for each cascade?
something to mark meshlets as visible/not visible
each bit encodes visibility of a single meshlet
bool IsMeshletVisible(uint meshletInstanceIndex) {
const uint maskIndex = meshletInstanceIndex >> 6u;
const uint bitIndex = meshletInstanceIndex & 0x3fu;
const uint64_t mask = RetinaDereference(g_VisibleMeshletBuffer)[maskIndex];
return (mask & (uint64_t(1) << bitIndex)) != 0;
}```
this is in the task shaderino
3x fu 😛
Tido has super sampling 
How do you set this mask?
when I cull
void main() {
const uint meshletInstanceIndex = gl_GlobalInvocationID.x;
if (meshletInstanceIndex >= u_MeshletCount) {
return;
}
...
const uint meshletVisibilityMaskIndex = meshletInstanceIndex >> 6u;
const uint meshletVisibilityMaskBit = meshletInstanceIndex & 0x3fu;
if (IsMeshletVisible(aabb, viewInfo, transform)) {
atomicOr(RetinaDereference(g_VisibleMeshletBuffer)[meshletVisibilityMaskIndex], uint64_t(1) << meshletVisibilityMaskBit);
} else {
atomicAnd(RetinaDereference(g_VisibleMeshletBuffer)[meshletVisibilityMaskIndex], ~(uint64_t(1) << meshletVisibilityMaskBit));
}
}
let me take a moment to curse at shaderc
Failed to compile shader 'D:/Dev/CLion/Retina/src/Retina/Sandbox
/Shaders/GBufferResolve.frag.glsl': shaderc: internal error: compilation succeeded but failed to optimize: ID '128[%g_Vi
rtualShadowInfoBuffer]' defined in block '9[%9]' does not dominate its use in block '258[%258]'
%258 = OpLabel``` fuck you
i understand 0, but its utterly fascinating, and im not trying to be sarcastic or anything. its 
maybe we lure the slang peeps onto our server and chain them somewhere in the basement, so that they can immediately fix those things
So if a meshlet stops being visible, how can it become visible again?
shadows do work with culling
and they're blazing fast to raster
VSM has been conquered
the mask is cleared every frame
if meshlet passes both frustum and hiz test it is set as visible
POG
okay so you draw meshlets visible last frame, build hiz, draw everything again and cull against hiz?
correct, minus the "draw everything again"
draw meshelts visible last frame, build hiz and...??
cull for next frame
when do you draw what was not visible last frame
don't try too hard to understand this method, it's garbage 
in theory after you build the hiz yes
I'm just not doing that rn for shrimplicity
damn, same scene for me but just using four csm cascades and I have 40fps instead of >200 :(
but that means that you can miss stuff that was not visible last frame and just became visible this frame no?
ye that's when you xor the masks
for a frame I guess, because it is visible this frame so you draw it next frame
two pass occlusion culling works like this
Draw(visibleLastFrame);
BuildHZB();
CullAndXor();
Draw(disoccludedCurrentFrame);```
I do this right now
Draw(visibleLastFrame);
BuildHZB();
Cull(); // for next frame```
right so you do have one frame of delay
yeah this is awesome
1ms
what gpu btw?
3070
now potrick must figure out a way to make hzb build go brrr
because currently that takes more than VSM raster 
I keep running out of resolution on the 0th clip on a 2k monitor 
with my heuristic?
I need to turn my 0th clip world size to 8 meters to work
nono yours its impossible
look at little sneaky guy
eeeeviiiiill
I wonder if better culling makes it faster to not cache at all
time to get back into the saddle
@wicked notch we can prob just not erite the lowest mips for hiz
maybe even only on page level
if we do it at page level it will be like 10 mics for all clips
idk how fucked the culling will be with that tho
ye I think the real strat is just doing reduction on the VSM itself
per page, that is
I think it can work, because it's effectively the same thing, just the sampling gets more convoluted
solution: make an issue
issues fix everything
is that you maisonbleu?
I'm not a blue house?
: )
oh, it wasn't french
i dont remember your nickname but you fly the team-effort-d3d11 role 🙂
or you are the 8bit guy
I'm 8 bit, yea
ah
OH you changed your name
you changed your whole account
nibble is 4 bits
half the intelligence
fair
i had no idea you changed user, thought you disappeared
ill never disappear
im not a gpu dev though

I ponder
if the first clipmap is smol
frametime goes up a lot in that goddamned spot
near the bushes that is
it's fine literally everywhere else
I need to inspecc more
renderdoc really hates my 2k^2 x16 layers x12 mips image though 
so here is a partially reduced version of the first clipmap in that area
could that be overdraw in the bushes killing the perf?
it should be mitigated by hiz
maybe debug draw the bushes like potti did few months ago
frozen frustum and fly around the bush
to see if it do hiz or hiznt
oh you sparked an idea in me
page heatmap
brb
renderdoc is sometimes bogus though smh
ahh cool idea
Why do we come up with the coolest of ideas when the articles dealing stares us down???
some people work better when there is pressure
I like to blame my weak discipline
out comes ze diamonds : )
found the problem at least 
tbf that's where most of the detail actually is
is it possible that the geometry there is just simply fucked (read not ideal) and they really couldnt just be bothered
mayhaps needs some work in blender
nah I see now what's going on
ah
and yes it is the geometry being horrid
but it's something worse
big meshlets
foliage is a completely disconnected mesh and meshoptimizer already doesn't care about spatial locality
so it just grabs whatever triangles it has available to build the meshlets, this makes the AABB huge
hence hiz can't cope
that means you need a carefully crafted scene to properly generate meshlets for all the newfangledisms
..or
fix meshoptimizer
you were never obligated to help with the writing
frogeshit*
I must uphold that duty
I invite you to read what we have so far on overleaf
I shall
gud
I wrote what we do in tido
looks like LVSTRI blogs bout his college setup
Excuse me but lately I've really busy. Please disenjoy and leave a dislike.
This video was edited too with ffmpeg, recorded with my smartphone.
its an italian keyboard layout but he tinkered with some keys
lustri i also just noticed your cmake installing deps thingy, thats neat 🙂
also also, you dont need to tell cmake about your headers, cpp is enough, when declaring a target add_executable/library
ik, do I do that schtuff somewhere
I think I do target_sources everywhere, maybe there's some stray header
root CMakeLists.txt
set(RETINA_HEADERS
and you still use cgltf, i will tell @loud crag
I dun see dis 
oh did you perchance remove it already :3
mayhaps
I don't remember ever having that but 🅱️erhaps my brain is failing me
as it usually does
oh
looks like i was on an old commit 
.Entry is still called .Entry though hehe
man your code is so readable
is it
I still need to make rg to separate the passes
rn everything is in application 
i fink it is, still not sure about the C for classes, but i very much stole the S and E prefix too now 🙂 and the auto Foo() -> ism
dlss commit is outdated it seems
make] fatal: Fetched in submodule path 'NVIDIAImageScaling', but it did not contain 35e13ba316c98eeecf16f37eae70ce88019911f6. Direct fetching of that commit failed.
[cmake] CMake Error at nvdia_dlss-subbuild/nvdia_dlss-populate-prefix/tmp/nvdia_dlss-populate-gitclone.cmake:62 (message):
it should be disabled on lunix anyway i suppose, perhaps with some autodetection and a message?
ye
trying clang 17.0.6
ah lol?
dlss cloning worked now
it cant find #include <vulkan/vk_enum_string_helper.h>
wot
hopefully not a new vksdk ;c
eh hang on
the string helper has been there since 1.2 at least
this is weird
vulkan/vulkan.hpp is coming from /usr/include/vulkan wtf
[deccer@rootfs ~]$ sudo pacman -R vulkan-devel
checking dependencies...
error: failed to prepare transaction (could not satisfy dependencies)
:: removing spirv-tools breaks dependency 'spirv-tools' required by glslang
:: removing vulkan-headers breaks dependency 'vulkan-headers' required by qt6-base
:: removing spirv-tools breaks dependency 'spirv-tools' required by shaderc
``` ;C
vulkan-headers methinks
i have no memories installing anything qt6y
kde I think is qt
issa ok, I'll fix linux building once and for all now
haha
ah
its gnuplot
and that stupid patchpanel for thingy pipewire
and obs : )
removing ffmpeg breaks dependency 'ffmpeg' required by firefox
nice
don't nuke your os 
: )
good thing is its ez to reinstall
but why would firefox have a hard dependency to ffmpeg
thats new
important question is, if i remove firefox and reinstall it, will it rember my current 250 open tabs 😛
ok i have to clean the tabs first anyway... will do that first
and i should also start working on local build containers
down to 150 tabs : >
down to 14

IDEs or analyzer tools want that and might break or fail to analyse the headers if you dont add the headers
perhap
had no trouble with clion or vscode-cpp so far at least
in my cmake/openglstarted and other cppisms for now
clion werks fine without
Finished revamping my 2pass occlusion culling to work with LODs, and be a lot cleaner/better in general!
Reusing the previous frame depth pyramid, instead of explicitly tracking cluster visibility between frames
@wicked notch you know if nvpro_pyramid works well on non-nvidia gpus? i’m thinking of using it, too, but reading comments in the source like „subgroupSize other than 32 is not tested, should work, message [email protected] if not“ is throwing me off a little
perhaps ffx spd is a better fit for me
Neither SPD or nvpro_pyramid work for non-power2 textures either, which is a huge pain. I haven't seen anyone handle it.
huh, are you sure?
the nvpro dispatcher seems like it can handle anything that is a multiple of 4 by default
that number is templated, though, so perhaps it supports other factors, too
It probably doesn't enforce power of 2 for the textures or generates
Which would prevent you from using mips of a single texture
not tracking visibility reduced perf a lot im my tests
maybe i missunderstand what you do
but if you use last frames depth i assume you draw twice, one culled against last frame once culling against partial new frame
Yes
I'm doing exactly what nanite does
Started porting SPD to bevy. Non power of 2 is actually easy. The samples out of bounds will just return 0, which ends up ensuring conservative depth, at the cost of a poorer culling on the edges of the screen. Totally fine imo.
nvpro_pyramid does support NPOT
Almost finished with porting SPD!
dlss be working beautifully
you can use the already exiting short cut alt+f4 to reload the thing
Following commit handles init expressions of struct's + inheritance of constructors.
#3878
The general implementation follows C++ init expression rules for derived classes.
The logic is general...
c++ classes 

slang has crappy classes
well it doesn't have classes, only structs, and those aren't crappy themselves, but the semantics for this in methods is hlsl-tier garbage
because this is basically inout
:barf:
make an issue 
Ok finished SPD port to Bevy/WGSL and using it in my nanite-impl 🙂
Much better perf
ah wtf
it's inout for hlsl
since hlsl be hlsl
spirv is actually a this I think
spirv doesnt have a concept of classes at all
does spirv have storage class - generic pointers?
I don't like spir-v generic pointer tbh
or rather
I'm kinda suspicious about just throwing it into vk as is
they're good if good tooling could use them (vcc & slang one day)
although I don't know the extent to which they can be used (never checked up deeply), I only really know about restricted ptr's

it's not a thing in vk so don't get too excited (though slang probably easily could be taught to emit that and then fed into shady to lower generic ptrs to tags
)
and the generic pointers issue I opened a while ago is on a really slow cook
I'm hitting the match dispatch limit for culling workgroups 😬
memory also continues to be really annoying to allocate large amounts of
how big
Caused by:
In a ComputePass
note: encoder = `<CommandBuffer-(3, 3, Vulkan)>`
In a dispatch command, indirect:false
note: compute pipeline = `meshlet_culling_first_pipeline`
Each current dispatch group size dimension ([103039, 1, 1]) must be less or equal to 65536
I can do the same dumb trick I did for the other pass and make it a 3d dispatch I then remap to 1d in order to get more workgroups 😛
Or use bigger workgroup sizes ig. Currently it's 64x1x1.
exactly what I did when I hit that snag
Why must I though :/. Why can't drivers be smart.
because actual magic doesn't exist and "magic" means heuristics and heuristics suck
im also confused about cmake 🙂
one cmakelists does if (RETINA_ENABLE_PROFILER) the other one does if (${RETINA_ENABLE_PROFILER})
i cant it to work either
i stole the profiler one
no matter what i set or option with ON and with or without CACHE and or CACHE BOOL it wont pick it up
unless i really have to yoink build/ physically and not just Delete Cache & Reconficture
yeah in this context it’s the same but it does have a different meaning
2^16 probably not 100% random
Mister mister @wicked notch please show your shadows
Thank you 
clang is a requirement
you can install clang through the visual studio installer if you want to use VS
Yes. Here's a breakdown of problems:
- The two green vkCmdCopyBuffer()s (4ms each) are staging buffer copies. I'll probably want to use ReBAR and directly map buffers
- The write_index_buffer pass (10.5ms each, 2 per view) are slooooow and have horrendous occupancy. Couple of things I can try:
- Have culling pass write out a list of all visible clusters, indirect draw to spawn write_index_buffer pass workgroups, instead of spawning 1 wg per cluster and just using early exit if they were culled
- Try to merge it with the culling pass
- Add software raster and hope it reduces the need for hardware raster
- Add mesh shaders to wgpu
- Raster (2 per view, 3.3/4.3 for main/shadow view in the first one, and then second is basically free because occlusion culling)
- I'm 83% PES+VPC throughput, which I think means primitive assembler limited? Nothing I can do to fix this really. Software raster + mesh shaders might help, again.
I don't remember our method for writing meshlet index buffers to be that slow tbh
idk how it came to be like this
This is the shader https://github.com/bevyengine/bevy/blob/0ebd414dbc13ef77b84627ca5d0e82b72cd50262/crates/bevy_pbr/src/meshlet/write_index_buffer.wgsl. 1 workgroup per instanced cluster in the entire scene regardless of culling.
Hold on had to go walk somewhere. Give me 15m to get back to my PC.
I'm suspicious that the spawn 1 wg per cluster and early exit if culled is too slow
Because notice both are slow, first and second pass
Even though the second pass should be doing nothing
pes + vpc probably means hw culling is the limiting factor
like backface and frustum
also im pretty sure nanite uses no index buffer
writing it is just too slow
what is the write index buffer doing then
I swear I turned off backface culling in my pipeline descriptor... I'll have to chdck
In my testing using a whole indirect draw args per cluster was wayyy worse. Maybe it's better when almost all of your clusters are software rasterizef though...
yes it is
idk what your index buffer is
writing out the meshlets should be at most 1/10th of the draw tome
time
1 u32 (cluster id + triangle id) per triangle, so that the vertex shader knows what to draw
so your indey buffer is really a primitive buffer i see
hmmm intresting that thats so slow
yes
ah btw
remove the wg barrier
the atomic will be optimized by the hw if you are on nv
the barrier will most likely make it slower
Will do when I'm home, 5m
wot
how does that work
amd and nv bith have heavy hw optimizations once you have extreme contention
so if all threads in a warp use the same address
it will catch that and do one atomic + warp prefix sum instead
instead of warp size n atomics
as soon as one thread diverges you die
in that case there is only one thread in the wg doing the atomic tho
yea i believe this will be slower
the gain is lower then the cost of the barrier
so what im suggesting is doing the atomic in all threads and let the he catch it and make it fast
this way you don't pay for the barrier
maybe that will make no difference tho
it can make a bug difference
but its hit or miss
what i did is write out a bitmask per cluster instead and early out in zhe vertex shader instead
this way you write way less
that was fast af
so what you're saying is that the driver just does this
const uint local_offset = subgroupExclusiveAdd(meshlet_primitive_count);
uint global_offset = 0;
if (subgroupElect()) {
global_offset = atomicAdd(..., meshlet_primitive_count * 3) / 3;
}```
shrimple as dat
hardware accelerated subgroup memes
yes
its the same with branches btw
if all threads have same condition only one path is taken
there are more insteuction slike this where the hw will check if all threads use the same x to go faster
sc?
the hardware also does stuff
shader compiler
imagine trusting any compiler for glsl
it's not a glsl compiler 
so it can do it even if its unknown
Ok I'm back, let me see
This is in reference to what?
the mask masks tris in a meshlet
so one bit is one triangle
i had a limit of up to 64 tris per cluster so it was just 2 uinzs
Here?
// Reserve space in the buffer for this meshlet's triangles, and broadcast the start of that slice to all threads
if triangle_id == 0u {
draw_index_buffer_start_workgroup = atomicAdd(&draw_indirect_args.vertex_count, meshlet.triangle_count * 3u);
draw_index_buffer_start_workgroup /= 3u;
}
workgroupBarrier();
How is removing it safe? Other threads may read the value before thread 0 writes it no?
did you try the prefix sum method btw
I wanted to try it but I then just hopped on the mesh shader train 
and the tri mask was the best?
on a 4080
well
ye I would imagine blasting 100 million vertices for a 4080 is just another tuesday 
well, on a 1080ti ir was kinda the same
So instead of writing a buffer of cluster|triangle IDs so each vertex invocation knows what data to fetch, you wrote out a list of just cluster IDs (1 per cluster), and then hardcoded the draw size to 64 * clusters? And then each vertex can find it's cluster via vertex_id % 64, and then you just output NaN for excess triangles?
i should have used nan as amd fats paths those
yes
I tried that. It was slow...
what was slow
The draws. All the extra vertex invocations were expensive.
yea but now your compute is slow
hrmm maybe it's worth a second test...



