#Iris - A Journey through OpenGL and beyond to learn Graphics
1 messages · Page 20 of 1
Anyone know where I can upload a 35mb trace?
Discord
It depends on the server
Idk, when I tried this several months ago it was slower. Let me try again
But also, can you explain the workgroup barrier omission first?
i think my draw was around 20% slower but most of my meshlets were relatively full
but the compute part to write was way faster
He is trying to make you invoke ub
just to an atomic per thread
They can't, I'd be growing the count too much. It's important that it's only incremented once.
I do have acess to subgroups though btw, as of this week
the hw will make sure it's only once
instead of vertex count, each thread adds one
oh
btw is it normal that your vkCmdDispatch is 232x232x232? 
kekek
I ran out of room in a 1d dispatch so I made it 3d
since it's 1 wg per cluster
huh weren't you doing the two pass kinda thing?
which is muuuch better since you avoid that early out
you also avoid spawning ludicrous amounts of compute invocations 
I am. Culling and draw index buffers are seperate.
No, It's not an indirect draw
It's on my list to write a compact buffer of surviving cluster IDs and make it an indirect dispatch, though
Or just combine the two passes if I can
maybe do some sort of subgroup/workgroup prefix sum to collacese the writes in the culling shader
sooo many possible options, tedious to test every one
// Reserve space in the buffer for this meshlet's triangles
draw_index_buffer_start_workgroup = atomicAdd(&draw_indirect_args.vertex_count, 1u);
draw_index_buffer_start_workgroup /= 3u; // ????
// Each thread writes one triangle of the meshlet to the buffer slice reserved for the meshlet
let meshlet_id = meshlet_thread_meshlet_ids[cluster_id];
let meshlet = meshlets[meshlet_id];
if triangle_id < meshlet.triangle_count {
draw_index_buffer[draw_index_buffer_start_workgroup + triangle_id] = (cluster_id << 8u) | triangle_id;
}
This can't be sane, right? So atomicAdd() is going to return the same value for every thread??
uhh, the value before all the adds in the workgroup, or after?
I think you can just write this like so:
do
you just use the offset every thread gets locally
the atomic add retuens a different value for each one
const bool is_valid_primitive = gl_LocalInvocationID.x < meshlet.primitive_count;
const uint local_offset = subgroupExclusiveAdd(uint(is_valid_primitive));
uint global_offset = 0;
if (subgroupElect()) {
global_offset = atomicAdd(..., (local_offset + 1) * 3);
}
buffer[global_offset + local_offset] = ...;```
I think this is correct?
you can remove the local offset and the subgroup elect as well
oh, huh. Don't actually enforce consecutive locations in the buffer are for the same cluster, ok
right
but yea you can do it like lvstri showed for piece of mind
lvstri your forgot a subgeoup broadcast first
🫨
subgeoup majonaise is my favourite dressing
@compute
@workgroup_size(64, 1, 1) // 64 threads per workgroup, 1 workgroup per cluster, 1 thread per triangle
fn write_index_buffer(@builtin(workgroup_id) workgroup_id: vec3<u32>, @builtin(num_workgroups) num_workgroups: vec3<u32>, @builtin(local_invocation_index) triangle_id: u32) {
// Calculate the cluster ID for this workgroup
let cluster_id = dot(workgroup_id, vec3(num_workgroups.x * num_workgroups.x, num_workgroups.x, 1u));
if cluster_id >= arrayLength(&meshlet_thread_meshlet_ids) { return; }
// If the cluster was culled or already drawn in the first pass, then we don't need to draw it
if !meshlet_should_draw(cluster_id) { return; }
// Exit this thread if the triangle slot is empty in the meshlet
let meshlet_id = meshlet_thread_meshlet_ids[cluster_id];
let meshlet = meshlets[meshlet_id];
if triangle_id >= meshlet.triangle_count { return; }
// Write cluster ID + triangle ID
let i = atomicAdd(&draw_indirect_args.vertex_count, meshlet.triangle_count * 3u) / 3u;
draw_index_buffer[i] = (cluster_id << 8u) | triangle_id;
}
Yeah?
What's the subgroup stuff for in yours?
uh oh, something is broken
oh whoops
Should be let i = atomicAdd(&draw_indirect_args.vertex_count, 3u) / 3u;
Ok I mean it renders fine, but it appears to be exactly the same perf...
when you add 3?
Code is a bit simpler though so that's cool atl
yea then its something else slowing it down
Why am I only at 20% occupancy :/
send spirv I guess 
maybe wgsl compiler is generating absolute retard code
but I doubt
i am downloading nsight nr
rn
i only have my laptop as im on a worktrip
also l´slow hotel internet
NSight also says this I noticed
Is that just because I have perfect occlusion culling lol
And it think it means culling is doing nothing?
What does it mean?
?
in the shader? nope
show frag shader
nsight is expecting a 3.3x speedup with early z enabled damn
It's for the shadow view only, the main view dosen't have this issue
ah you write frag deprth
Right, it's rasterizing depth
Eventually when wgpu gets image atomics, I'm just going to use that
Yeah this bit:
#ifdef DEPTH_CLAMP_ORTHO
@fragment
fn fragment(vertex_output: VertexOutput) -> @builtin(frag_depth) f32 {
return vertex_output.unclamped_clip_depth;
}
#endif
you could just enable depth clamp in your pipeline state
This is from somewhere else in bevy, let me figure out why that has to be a thing
or write a color attachment if its special
no its not
this will write the depth buffer
it will disable early z
as you could change the value to be smaller then it is when early z runs invalidating it
bevy going hard i see i see
i think you can just make the frag shader empty
return nothing
depth clamping is necesary for some shadow mapping techniques
but it's an option in your rasterizer state that you can just enable, there's no reason to do it yourself
Rtx 3080
am I dumb, I don't see it?
a simple, no-frills, command-line driven
pastebin service powered by the Rocket web framework.
vkCmdSetDepthClampEnableEXT
I don't think wgpu exposes it xD
bruh
When you set DepthClipEnable to FALSE, the hardware skips the z clipping (that is, the last step in the preceding algorithm). However, the hardware still performs the "0 < w" clipping. When z clipping is disabled, improper depth ordering at the pixel level might result. However, when z clipping is disabled, stencil shadow implementations are simplified. In other words, you can avoid complex special-case handling for geometry that goes beyond the back clipping plane
how does anyone unironically use D3D12
you will loose a ton of perf if you can not do early z
but you can force it on in most apis
I'll add it to my list of "things that will be faster when X wgpu issue is closed" :/
LGPU
ms is meming me or vk spec is
I think MS is meming
I don't see vk spec talk about w, it seems to be operating after / w
so it's talking about clamping z/w
what this tells me is that we don't have enough CTS in this area
in vk
or I'm misunderstanding something
This shouldn't matter once I add software raster anyways, as I'll be directly blitting the depth in the fragment shader
Ok, so, raster is not going to get any faster anytime soon until I get software raster or mesh shaders
Back to writing out an index buffer
did you profile your shader yet
So it osunds like on directx we can remove the fragment shader workaround?
idk what it sounds like
jasmin could you try increasing the wg size for me in the write index buffer shader?
btw you hsoul default to 128 or 256 size wgs
if your intent is to clamp things no I don't believe you can
ok this makes more sense
Idk I'm lost by it lol
the 3080 cant reach peak occupancy for workgroups smaller then 96 threads but with 128 it can reach max launch speeds for warps as well
What about AMD?
ok, I'll switch culling to 128, that's easy
but for things like this that finishes fast its better to have larger groups
well culling is divergent
so maybe I'll keep it 64 lol
with divergent i mean massive differences in runtime between threads like 10x
oh ok
for example compute raytracing for ssr or ssao
thats mega divergent
culling should be pretty cozy
Ah, thank you. Yay for learning.
Ok so, profile write_index_buffer, then try increasing the thread count
somehow the compute job acts as tho you have workgroups the size of 32 or so
in nsight it looks like that at least
No, 1 cluster per thread, so some may get to occlusionc ulling, others may get frustum or lod culled and exit before that
I mean the write index buffer
my bad
culling speed is not an issue is it
even if you diverge like that it's pretty minor
yeah cullign speed is like 0.3ms on this insane cluster count
it's nothing
raster + write index buffer + per frame data uploads are the issue (ignoring shading, currently broken, and inefficent anyways :P)
It takes a good 5m per trace because there's so much data lol, give me a few mins 😅
I'll have to switch to 1 frame per trace, even 2 takes forever
oki
btw @primal shadow its ub to return from a shader before entering barriers
all threads must enter the same barriers
whuh
or does wgpu do some magic?
I assume wgpu is protecting me here
Cause I don't get any errors from them
Actually no, meshlets don't use barriers anymore
yea it will work
Except for downsample depth hiz
most of the time
when I just copied from SPD
but then you wll die randomly
can wgpu even detech such cases?
Idk how to export the shader, so let me upload the trace again ig
yea that would be sus as then shading langs should do that also
this is with 64 wg still
no
ah i see
webgpu also leaves that to be ub
good
no safe in gpu land
safety memes don't work in gpu land
i connected to gabes neuron to help
checkmate
I wish webgpu was more like wgpu_hal sometimes :/
I am 4mb over the limit even when compressed ahhh
nvm I was overthinking it
anyways this is the only expensive part of the shader
but ye webgpu does not protect you from ub
you will get ub
it's extremely infectious
and a little here
ok i have a suspicion
reminder that lgsb means waiting for global memory op
does it run with a workgroup size of 32?
and loops?
can zou see that in the spirv?
reads or writes?
what i dont get are the unused warp slots
long scoreboard (waiting on a read) doesn't have anything to do with why the workgroup is size 32
either
it can be texture samples, loads, stores, atomics, whatever, but operating on global memory (or scratch/private/stack/whatever too)
(presumably is size 32 with a loop, because there's no other explanation)
jasmine in the spv at the top can you give us the workgroup sizes?
it dosen't say?
11, OpExecutionMode %write_index_buffer LocalSize 64 1 1 ; 0x00000098,,,,,0,,,
64x1x1
ok
i suspect its launch bound, so the gp cant launch new workgroups fast enough and the warps retire so fast that the gpu cant go full saturation
so making the wg bigger should help a lot then
nsight won't tell me that?
at least to increase occupancy
i think it did say that at some point but they removedthe metric
they changed it a lot last year very annoying
probably because I spawn a bunch of dummy workgroups that early exit as soon as it checks the cluster is culled...
Changing it to an indirect dispatch over surviving meshlets only should help I believe
or ideally just combine the pass into the culling pass
if only

Write index buffer is wayyyy too expensive. Ideas/options:
* Increase WG thread size
* Write out a buffer of surviving cluster IDs, using an indirect dispatch
* Combine culling and writing index buffer, somehow
* Use multi draw indirect, 1 IndirectDrawArgs per cluster
* Use a fixed size 64 * surviving_cluster_count draw, write out a buffer of surviving cluster IDs, and then have the vertex shader write NaN for excess invocations
launch bound makes sense in that case I believe. I'd try looping to handle multiple things in 1 compute thread
but use 128 wg size either way
horrid ass 30 series cards
does 40xx improve things?
all other nv and amd gpus
loop, or just use bigger WGs?
have a ratio of 2:1 for warp to workgropup slots on each sm
only the 30 series has a ratio of 3:1
they did that so that they can have more registers for raytracing for each thread but it turned out to bea shit tradeoff
both
dumb nv
every nv generation except 30xx does not have this horrid cripple ass limitation
I daily drive a 3070 🤡 🔫
to be clear tho, having larger wg sizes for these workloads is better on all gpus
me when I have a 3070 too
its jsut that the 30 series can never ever reach full occcupancy at all with 64
Write surviving cluster buffer + indirect dispatch to get rid of "spawn 1 wg per cluster, just exist immediatly if culled" is also an option
Hard to say 😅
128 workgroups
i will check your shaders tomorrow
I use 256
just cope
nice
I daily drive intel hd graphics
this bad boy has 2 MMUs
does your gpu have 2 MMUs?
that's right, it only has one
honestly idk how the fuck queue bind performance still sucks after 13 years first gpus shipped with this feature
on windows idk what's the problem but elsewhere certain drivers oversync to hell
its very sad
hazah, write_index_buffer is now ~5.4ms using wg=128, 2x as fast!
....still very slow 😅
256 time now
I did the math here correct, right?
var cluster_id = 4u * dot(workgroup_id, vec3(num_workgroups.x * num_workgroups.x, num_workgroups.x, 1u));
let triangle_id = local_invocation_index % 64u;
cluster_id += local_invocation_index / 64u;
And is the compiler smart enough to combine the % and /?
4 clusters per workgroup
64 triangles per cluster
you could easily just do >> 6 and & 0x3f if you don't trust your compiler
yes the compiler will find that
the compiler will ABSOLUTELY do that for you
so long as you're dividing by a compile-time constant
if its not you die by stroke
if it doesn't do it for you, abandon wgsl. Don't care.
(the divisor is not constant, you will be asleeped by the compiler forever)
driver compiler will do it
the other day I heard about a d3d11 game doing 1D dispatches and then in the shader it was translating that to 2D coordinate by doing stuff like % thingy.GetDimensions().x etc
you could do 64x4x1
then the math will be simpler
oh, true lol
where reactions
they are literally doing int div by a non-constant divisor
they are amazing i will hug them
potrick is MS spy
hug of death
Slightly faster with 256 WGs, 4.3ms now
wasted invocations and mem
probably memory latency
wated invocations are ok
as long as its a good tradeoff compared to other options
Memory latency, where? The atomic write? the looking up meshlet data?
- Multi draw indirect is too slow, bottlenecks at command processing, single draw indirect is neccesary
- Using fixed-size increments of 64 vertices and outputting NaN for excess triangles slows the draws down by a decent amount, but it's an option
- Somehow combining culling and write index buffer shaders are an option, but difficult to get working well, as each thread is writing a different amount of vertices
atomic and writing yea
mesh shaders
I don't have atm :/
I think my best bet is either combine the shaders if I can, or make write_index_buffer an indirect dispatch
garbage nonsense limitations don't allow your shaders to be limited just because Apple exists
or cope that that's why it's so slow
I need to use wgpu regardless as the rest of the engine uses it
yeah I'm just raging
afaiu you can just throw wgpu out of bevy?
just not use it
you mean you're working on the engine
Yes. I mean I could use vulkan directly, but I would lose way too much, and it'd have very little chance of getting upstreamed into bevy,
I don't get how people aren't more upset about the limitations on web
like you are trying to make a nanite-like renderer
you absolutely need to utilize the hardware as much as possible for that to even be viable
mesh shader extensions were added to Vulkan SIX years ago
that's 2018. Vulkan was initially released in 2016... Mesh shaders are almost as old as Vulkan is 💀
but yeah web has no support 😐
patrick no
🤓 👆 the cross platform extension came 2022
*for vulkan, yeah
web is always needs to support all the bad old hw
wgpu isn't restricted to web, they have some exts in there not available on "the web"
but yeah it seems like it's another layer of bikeshed
yes
Fixed large bug, down to this now
mhmm, write_index_buffer is still chunky due to all the extra dead workgroups
Very well done, good job!
and the 8ms 2x copy at the start is due to lack of ReBAR support
nope
Right, but now the culling pass needs to output a buffer of surviving clusters 🙂
and aotmic increment an indirect dispatch
I'm going to be sneaky and reuse the index/triangle/primitive/thing buffer for the surviving cluster buffer though
Maybe, idk
👍
Is there a way I can avoid the large data upload at the start of the frame? For each cluster in the scene (there might be millions), upload 2 u32s (meshlet ID, and instance ID). Is there a way I can make this persistent, maybe? Idk.
Instance ID tells the cluster what instance/entity transform to read
Meshlet ID tells the cluster what meshlet asset it's an instance of
No matter what I do, I'm going to need more buffers 😭
Buffers for telling software raster what cluster to use, and same for hardware now
I mean 4 bytes per cluster is not much, but dynamically managing the buffer size is sucky
thank god im not alone 🙂 +1 for all the things you said
its a lot actually
its way too big considering pcie speeds
you should have a list of meshlets per mesh, a list of meshes per entity and a list of entities.
Then expand from entity, to meshlist to meshes to meshlets
uploading 8 bytes per meshlet a frame will never scale its way too much data
btw it looks like every single command has its own command buffer for some reason
didnt you have that already and nanite like lod selection? maybe im confusing projects
wait we are in lvstris channel the whole time 
they named a user after DeccerCubes
they named a cartoon after a shadow artefact
Wgpu does that, apparently... Not sure how much perf that costs me, but

I know :(. But I dispatch basically over a flat list of meshlets. How else does each culling thread know what to operate on? I can't figure out how to avoid it...
for you its probably not too much
Yeah I have runtime nanite like lod selection
there are multiple ways to do this
the simplest is prefix sum + binary search
in this gist i link example code for binary search + prefix sum
Ok lol I did think of that actually, but it seemed like it would be slow to have every thread of every meshlet pass do a whole binary search 😅.
Oh nice this describes exactly my problem lol. Let me give this a read.
every single dispatch or something
it has some oberhead yes but its ok
its much better then uploading all from cou 
wgpu sync moment
Yeah it's inefficient tracking. There's an issue open for redoing compute shader tracking at some point to be more fine grained.
wgpu also spamms hash maps
the correct answer is not tracking
ye
I read the whole thing, didn't quite understand the last solution 🤔. Mind explaining it again?
The power of 2 argument buffers
one sec
the idea is
for example
meshlet cull shader feeds meshlet draw shader
i will answer soon
No rush, thanks
sorry other way
so mesh cull shader, each thread does one mesh
expands to meshlet cull shader
each thread does a meshlet
the eypansion between them is asymmetrical
as meshes have differing amounts of meshlets
the idea now is
to use a property of binary numbers
in the mesh culk you take the bumber of meshlets, and write one of 32 buffers, for each bit set in the meshlet count number
this way, instead of writing n values where n is meshlet count
you write up to 32 times, one time per set bit
then these 32 buffers contain a aegument each
deoending on their "power" each elemnt desceibes 2^n work items
so then
you do 32 dispatch indirect
for each arg buffer
for the meshlet cull shader
and each thread can simply bitshift their id by the power
to get the argument
its a little involved
hard to explain fir me
Uhh so let's say you have 10 meshlets
And it gets culled down to 4
Can you give a practical example of what it would look like?
yea
id you have a mesh with 12 meshlets. It will first cull the mesh if it survivies it will write arguments to cull 12 mehslets
thats 1100 binary
so it will write the 3rd and 4th buffer
what if you have 2 meshes with 12 meshlets?
then
Do you need 32 buffers per mesh?
append how?
nah nw, I'll keep thinking on it. Thanks!
Ok
so an example would be
Surviving Meshes meshlet counts:
mesh0: 6 meshlets
mesh1: 3 meshlets
mesh2: 1 meshlet
mesh3: 5 meshlets
mesh4: 7 meshlets
Each bit in the meshlet count maps to an arg buffer:
Surviving Meshes meshlet counts: | arg buffers appended:
mesh0: 6 meshlets | 00000110
mesh1: 3 meshlets | 00000011
mesh2: 1 meshlet | 00000001
mesh3: 5 meshlets | 00000101
mesh4: 7 meshlets | 00000111
in this example we have 8 arg buffers
so the arg buffers will contain
arg buffer 0:
[mesh: mesh1, meshlet offset: 0] [mesh2, 0] [mesh3, 0]
arg buffer 1:
[mesh0, 0] [mesh1, offset: 1]
arg buffer 2:
[mesh0, 2] [mesh3, 1]
each argument is a pair of mesh index and meshlet offset into the mesh
so when for example looking at mesh0, you will append to arg buffer 1 and arg buffer 2
into 1 it will append mesh0, offset 0 and into arg buffer 2 it will append mesh0, offset 1
you using metal or why are you calling these argument buffers
arguments is the naming convention for some renderers for buffers that determine what threads in indirect dispatches map to
So max of 8 meshlets per mesh, so 1 + ceil(log2(8)) = 4 argument buffers?
Then after culling, if there are N meshlets left for a mesh, append this mesh to the argument buffer X for all set bits X in N?
And ok, you append the mesh ID to each of those argument buffers
But what are the meshlet offsets?
you append in the mesh culling not meshlet culling
these arg buffers are then read by meshlet culling
all meshlets of a mesh go into arg buffers
Right, I don't actually do mesh culling, I want to use this for meshlets culling -> triangles, but same concept so anyways
technically not needed
it tells the shader at what point to start working on the mehslet list
so for the example of mesh3:
first arg is in arg buffer 1: [mesh3, 0].
second arg is in arg buffer 2: [mesh3, 2]
in the meshlet cull, the threads read their dispatches power, so arg buffer 1 has power 2, arg buffer 2 has power 4.
they div their thread count by power and mod it by power.
the dived value is the arg index, the moded value is the offset inside the arg.
the first arg isin buffer 1
starting at 0
Buffer 0 you mean?
What does starting at 0 mean though
I'm lost on what the offsets are for
Also, won't this ruin cache efficiency, if you're spreading the meshlets for a single mesh out across workgroups?
starting at the 0th meshlet of mesh3
two threads will cull in arg buffer 1 dispatch
then in the arg buffer 2 dispatch, 4 threads will work on the rest of the meshlets
but they need to start at meshlet index 2
as the first arg buffer dispath alreadz worked on the first two meshlets
technically
you can also get that value by masking the mehslet count lower then the powers index
i will improve the example
Please do, I am lost 😅
yea its hard to understand
a lot of bit twisting and indirections
i thing better examples
will make it clear
"I don't have time" says bro while posting a detailed elaborate graph with colors and explanations
ooooh ok this makes sense now!
Each item in an argbuffer represents 2^i units of work
hence the mesh_id of where it came from, and the meshlet_offset
So given the 13 meshlets in your example
it would be:
argbuf0: offset=0..=0
argbuf2: 1..=4
argbuf3: 5..=12
except you only need to store 0, 1, 5
ok two questions then:
- How is this cache efficient, if you split the meshlets for a single mesh between many argbuffers?
- Isin't this very lopsided work-wise? Each argbuffer has exponentially more args to process. Are you supposed to then take that mapping, and redistribute it somehow across the threads? Or is it fine
Also I need to go figure out how many threads you spawn to process each argbuffer, and how it'll work with multiple meshes
Regarding the second point, I believe you'd just launch the appropriate number of threads for each buffer, so that each thread gets one
First question I don't understand, I doubt cache comes into play on a meshlet granularity
meshlets need to lookup their mesh's transform and such
I also want to try and use this for meshlet->triangle writing
for small numbers its not cache efficient
it will also make the atomics much slower
as you need to do more of them
but for large numbers the bandwidth and divergence benefits win
i didnt test much but meshlet count over like 8-32 were better already with the po2 buffers
any if you have larger meshes its muuuhc better
I dont get point 2
how would it be lopsided
The idea is to launch more threads for bigger arg buffers
It's 1 thread per unit of work, of which there is 2^i units of work per arg
you launch (argcnt * 2^(arg buff index)) threads per arg buffer
so you waste nearly 0 threads
the po2 solution is by far the fastest i could find so far for this workload btw
what could be much better are workgraphs
they solve this on an api level
Is cache locality that big of a deal on GPUs anyways btw? You have way more latency hiding compared to CPU, I don't really hear much about it
Really, hmmm
Interesting yeh, I've not really been paying much attention to cache when writing shaders
Other than shared mem I guess
its ver intrecate
There is so much to think about
it can be better to shit on cache locality if it means you safe a ton of mem bandidth like in this case
but you can do a lot of warp level optimizations btw @primal shadow that make the cache locality pretty good again fo the writes
Does each arg buffer have to be a seperate dispatch?
yes
haha I'm not getting those for ages 😅
64 dispatches (64 triangles per meshlet) seems... expensive?
That reminds me btw @wicked notch if you make your pages 128 your mark visible pages pass (if you scalarize) will be > double the speed
especially because I have no way of overlapping them...
no wgpu will insert barriers 😭
i really hope wgpu is not that crippled
it's already blazing fast for me

each dispatch is treated as a "usage scope", and will ocmplete in order via barriers between them :/
Was like 1.2 ms without scalarization for me
about 34 microseconds
Wtf
its also a dispatch per arg buffer so 24 or 32
not 64
How
I really do nothing special there
show code
Do you do atomics or?
64 triangles per meshlets = 64 arg buffers, no?
I do no atomics
oh if you can only go up to 64 you only need 6 dispatches
no
the arg buffers double the work each time
Nevermind then
log2(64) = 6*
ah yeah with caching it would be a problem 
6
6
shhh
keko
6
ok maybe 6 is fine barrier wise then
You need atomics and perf dies ☹️
atomics scalarization makes it warp speed again
Young man does image store with no atomics with multiple threads writing the same spot
Is that legal?
And watch your perf plummet
yes 
Btw how do you allocate? You do it after?
ye
Loop through VSM and find all visible this frame and allocate
i will get the hammer
ok I will fix
how does that make sense
is it jsut marking it?
like or iing a 1 on?
They all write the same ye
yes
then atomic store might actually just be the same speed
on nv hw
could be that that stores of that size are inherently atomic
relaxed atomic writes on x86 are also completely free
yea
it's a 2KiB image it's fiine
the tap is shit but if its so small its probably ok
also the scalarization would still help btw
As Jaker would say, it doesn't matter for now
less contention still
real
its fine like so
tido crashes on amd shader compiler crashes
goofy aah drivers
I just noticed the pass spiking to 130us when I reduced the page size to 64
why 64
Was just experimenting with values
You theoretically get better culling
eh
And tighter allocation
I also tried 64 but for a different reason
I wanted to see whether smaller pages could help with projective aliasing
You have 64 now
(they don't)
It doesn't matter what size the page is
For the resulting shadow quality
The information density is the same
ye I realized that later 
You get tighter allocation and culling
btw
did you figure out how to reduce the VSM
because there's a resolution difference between the VSM itself and the virtual resolution
ye
I mean you just make MIPS of the memory block and reduce each page there (the mapping is preserved 64 wide page become 32 wide in MIP 1) and once you reach max MIP (each page is 1 Texel) you use that to build the virtual HIZ on top (starting at page table resolution and going again Down to a single texel)
That is my current working plan for implementing that
hmm I see
but these are future plans
I am stuck on fixing tido bugs and dynamic object shadows
I wanted to generate the invalidated pages mask on the GPU but the geo pipeline we have in Tido is hellish (and I still don't really understand it) so I'll prob cope with CPU approach
So uhh, if I just remove the write_index_buffer pass entirely, and go back to doing it from the culling shader itself, its much faster...
if meshlet_visible {
let meshlet = meshlets[meshlet_id];
let start = atomicAdd(&draw_indirect_args.vertex_count, meshlet.triangle_count * 3u) / 3u;
let base = cluster_id << 8u;
for (var triangle_id = 0u; triangle_id < 64u; triangle_id++) {
if triangle_id >= meshlet.triangle_count { return; }
draw_triangle_buffer[start + triangle_id] = base | triangle_id;
}
}
probably cause nearly all threads were dead in the write index buffer pass
for writing the indices i think it might be better to just loop btw
cause the numbers might be too low and similar to get any gains from po2 arg buffers
yeah I think it would be problematic
I've gotten it down to a very very reasonable amount of time 🙂 (minus the data uploads, stupid data uploads)
Ok, idea
We already have a list of instances
Run a compute shader to write out the per-cluster data directly
I could also slap on per instance culling in the same shader
you should probably look at othe rpeojects for this as well
most of this has been done many times already
Occlusion culling was broken (accidently bound 1 mip, instead of every mip), fixed it. Still some overdraw, but that's always been the case. I need to tweak the occlusion culling some more.
I think it's time for persistent threads for culling tbh.
Should save a lot of culling time, and also solve my data upload problem. I won't need to upload 8 byters per instance of a meshlet every frame. Instead, I can just upload 1 u32 per mesh instance pointing to the BVH root of that mesh's meshlet tree.
you dont need persistent threads for that
you'll have to make a lockless queue and make a table of device to threadcount for it
very shaky
Well, the alternative is a bunch of dependent dispatches to traverse the tree...
yeah, but perf
So my test scene has 12_463_400 clusters
I upload 8 bytes per cluster each frame
PCIE x16 Gen3 bandwidth is 16 GB/s
12_463_400 * 8 ~= 99.7 MB
At 16 MB/ms, that's ~6.23ms of upload time
NSight shows 92% throughput, and overall ~8.4ms of upload time
So yeah, far too expensive
yea thats horrible
I need to reduce this down to per-instance data upload
you can also make it completely gpu driven
The power2 dispatch trick would work well, and then write out per-cluster data from a compute shader, but I would have to put a barrier between every compute dispatch because of wgpu (unless I hack around it?)
Alternatives include making the clusters form a tree of LODs, and writing out only the LOD tree ID per instance, and then making the shader traverse the tree instead of a flat list
Or uploading a per-instance prefix sum of meshlet counts per instance, and having all 3 shaders (cull, raster, shading from visbuffer) do a binary search to map thread ID -> instance ID :/
How so?
Oh wait, there's another option here
Upload the prefix sum yes, but then have a pass at the start perform the binary search and output flat buffers
do zou do the lod traversal and nanite thing on cpu_
?
i mean you can try persistent threads
but i think that will be pain
No, it's on the GPU, over a flat list of meshlets (you don't have to use a tree)
- On the CPU, compute a prefix sum buffer of meshlet counts per instance, and indices to the start of the meshlet buffer for that instance
- On the GPU, run a new pass once at the start of the frame, 1 thread per instance of a meshlet (same as culling). Each thread (cluster) does a binary search on the prefix sum meshlet count list to find their instance ID, plus meshlet asset ID, and then writes it to the buffers for subsequent passes.
New buffers:
meshlet_mesh_slices = [2 u32's per meshlet_mesh asset containing buffer slice]
instance_meshlet_meshes = [map of instance -> meshlet_mesh_slices id]
meshlet_count_prefix_sum = [scratch buffer of size instance_count]
Shader 1: <perform prefix sum of 1 thread per instance to fill out meshlet_count_prefix_sum>
Shader 2: <follow below of 1 thread per cluster to fill out thread_instance_ids and thread_meshlet_ids>
let cluster_id = global_invocation_id.x;
let instance_id = <binary search in meshlet_count_prefix_sums using cluster_id>;
let meshlet_mesh_id = instance_meshlet_meshes[instance_id];
let (meshlet_start, meshlet_end) = meshlet_mesh_id[meshlet_mesh_id];
let meshlet_id = meshlet_count_prefix_sums[instance_id - 1] - cluster_id; // If instance_id = 0, meshlet_id = cluster_id
meshlet_thread_instance_ids[cluster_id] = cluster_id;
meshlet_thread_meshlet_ids[cluster_id] = meshlet_id;
@wicked notch do your clusters (LOD 0 or otherwise) reference a single set of vertices, or do you have a new set of vertices per cluster?
hrmm, I have a single set of vertices, and just reference them per cluster
everyone seems to do the opposite though
having a single set of vertices is gonna eventually not be compatible with streaming when we add that
mhm
ive never heard of that
every impl i saw is referencing a single set of verts
aside from those that quantize
Saved about 32ms of frame time in my nanite impl 😄
~8.4ms GPU, ~17.5-27.8ms CPU
No more uploading per-cluster data
170 FPS
This is like nanite outpost. Maybe this info will be useful.
Guys at epic games have been really busy making nanite better. Now you can have tessellation, displacement and skeletal animation with nanite.
They got rid of material buffer(abusing depth buffer with OP EQUAL) and now they shade in compute shader. They detect which derivatives(by looking at the dissembly of the vendor blob) are used and then select which pipeline is used by doing this they reduced amount of helper invocations by 6-10%. They now use VRS which 27-45% reduction in shading. Also something to make lpotrick little jealous is that now they use work graphs.
There is nothing(only git commits) on how they did tesselation, displacement a skeletal animations yet.
Yeah I read their presentation (it's very long). There's also (not unreal) http://filmicworlds.com/blog/software-vrs-with-visibility-buffer-rendering/
VRS/compute shading is on my list todo, but not very high priority atm
workgraphs (and mesh shaders) I don't have access to sadly
the other stuff there are papers on, although idk how nanite specifically implements it (I don't read their code)
I'm still a good bit off of nanite 1.0 levels of perf, although I'm making a lot of progress
There's also VSMs, which I haven't implemented at all
Per-view nanite is very expensive... and I want shadow views...
lustri, im really fond of the new auto thingy() -> shizzle {} way of conjuring c++
thanks for ze inzpiration and being a part of my life ❤️ 🙂
Looks like Rust 🤔
it do be
i am curious, why do you prefer it over shizzle thingy() {}
: )
great argument
that way all declarations are neatly aligned
virtual auto Initialize() -> bool;
virtual auto Load() -> bool;
virtual auto Unload() -> void;
virtual auto OnMouseMoved(double mouseX, double mouseY) -> void;
virtual auto OnMouseEntered() -> void;
virtual auto OnMouseLeft() -> void;
virtual auto OnFramebufferResized(int32_t width, int32_t height) -> void;
virtual auto OnFilesDropped(const std::vector<std::string>& fileNames) -> void;
thats nice
tbh yeah, but I don't use trailing return type when the return type is void, bool, or any other four character type
and generally tbh I don't like trailing return types unless it's deductable
i dont know what that means, or rather how it would look like
leaving stuff out feels weird
just as an example from my math thing im writing right now:
constexpr auto operator*(vec<T, M> other) const noexcept {
vec<T, M> ret(T(0));
for (std::size_t i = 0; i < M; ++i)
ret += col(i) * other[i];
return ret;
}
it just knows that the return type is vec<T, M>, and then just auto is enough
ah
same happens with lambdas, generally
thats too next level for me still
although, in c# i do the same, when it comes to lambdas
and in c++ too i think, when it comes to lambdas
eh the other thing is prefixing types
EMyEnum, SMyStruct, im not cool with CMyClass yet 🙂 too deep wounds from the 90s, but its a logical thing to do
good vrs os extremely invasive in the whole renderer
its not something that can be added easily
wait why is the cou time so high
shouldnt it be totally gpu bound if is nanite like?
It used to be, because I was uploading 8 bytes per cluster for like 128 million clusters. It's very little CPU load now.
ah thats the savings number
Yes
went to the retina repo for some inspiration and noticed you managed to get an all-numbers commit hash (at least the short part github shows)
i also prefer this style :^)
it becomes significantly more beneficial when you start doing things with templates
any kind of long name in a return type, or any kind of sfinae or type trait nonsense is completely unreadable when not using trailing return types
- trailing return types for member functions have access to the namespace of thr class they're in, which is great if you're using aliases a lot
Foo::Bar<Foo::Baz> Foo::buh(); will always be inferior to auto Foo::buh() -> Bar<Baz>;
just better in every way
you're saying that normal return types can't? I do that all the time
not if you define them outside the class body, no
I'm on mobile so didn't want to flesh them out :)
: )
monsieur lustri also added rust like typisms ala "self" too, but im not there yet
can't use those bc I'm stuck on C++17 :(
Ok since I'm going to have to do it 4+ times/view I decided to allocate buffers to hold instance+meshlet ID per cluster in the scene, instead of doing the binary search in every pass
Unfortnatently that means another 8 bytes/cluster of memory usage :((
100 million clusters = 800 mb 😦
In the scene, not rendered
And LOD clusters count towards that
~6.7ms/frame, but ~2ms of that is CPU time 😦
I think if I want to save 8bytes/scene_cluster, I'll need to switch to persistent culling
Also save some culling time, probably like ~0.3ms
Culling is already extremely fast for me though
how many bunnies is that
~3100
So I've been coming back to this. Yes, clipping the triangle vertices is wrong. But why not clip the triangle AABB you use for testing pixels covered by the triangle?
Wait, I think Nanite only forces HW rasterization for depth clipping, and not for viewport
that makes more sense
something's cooking lads
pasta water?
Nanite 2.0's oven timer just went off
lustri my frosch, you are using cmake's pchisms
but im a little bamboozled by it
you create a pch library so to speak, and in it you have all the stl headers you use across the retina
yet you include the pch nowhere, but whatever stl header individually again
does "linking with the pch lib" take care of the usual "include <pch.h>"?
I think the pch is automatically included by everything in the cmake way
yes
I think it means those includes are now free
at least that was my understanding how cmakes pchisms should work
they're redundant yes
ah
I'd keep them, imo it makes things more readable
: )
that is/was the conchfusing conchfusion
thank you my froggies
gcc 14.1.1 is out btw
Someone beat me to it.
When will your meshoptimizer fork support IDK_BC5_normal_metallicRoughness 🐸
@wicked notch are you alright
it depends
i heard italy got hit by earfquakes, around napoli
reminds me of the earthquake that hit where I live, and I didn't feel shit
we can make pizza outside of napoli you know
found it
@primal shadow what gltf extension do you mean exactly?
I made my own GLTF extension so I could store meshlet data in it and integrate the new type of geoemtry with our existing scene and asset system
ah you said „bevy meshlet extension“ in showcase and that didnt sound like it was just custom for your engine
Ah no it's a custom extension, bevy is the name of the engine I work on
Awesome work. Starting on this and my goal was to get Bistro itself rendering.
This is what I'm rolling with. Doesn't eat up 5GB of VRAM so it's a start,.
I read https://blog.traverseresearch.nl/creating-a-directed-acyclic-graph-from-a-mesh-1329e57286e5 but am following along https://jglrxavpok.github.io for comparison. They do merging or cleanup of vertices at some point, but still METIS + Meshoptimizer.
If you want another point of comparison, my code is at https://github.com/bevyengine/bevy/blob/main/crates/bevy_pbr/src/meshlet/from_mesh.rs. I haven't implemented vertex welding/merging yet.
I'm also working on writing up a (very long) blog post detailing how everything works in my meshlet renderer so far, that'll come out sometime in the next month probably
Nice. Looked over the earlier PRs. Yours is definitely easiest to follow if nothing but a reminder that my occlusion culling doesn't fully work. 
Seems like an unending problem.
Neither does mine, it seems to over-cull on thin planes 😦
The screenshots of bistro I took were carefully positioned to avoid showing the bug 😅
I'll figure something out. Using WebGPU via Chrome so a combination of JS and Rust/WASM.
Will at least create new bugs.
WebGPU shouldn't really limit you for this. No min sampler mode, but it's not neccesary anyways.
It's just a lot of working implementing everything and maing it fast 😅
I still need to redo a lot of things
It's just the browser really. Need to see if any GPU debuggers work.
Plan to shoehorn it into WebGL where I'm really limited.
do you give up after you can't simplify anymore without trying to reach one node at the root?
Yep
Why do you ask?
because vertex welding is making me sad
Ah. Why so?
What kind of welding do you mean? Just de-duplicating? Or merging nearby vertices?
uvs
merging nearby verts
Yeah UV seams make things really difficult
I'm starting on my blog post to accompany the Bevy 0.14 release notes for the new (experimental) meshlet feature. Maybe it's just my writing style, or the amount of info I have to cover and how much I hate writing, but it feels very dry :(. I'm hoping it's at least informative? Idk, only half done, and the asset processing stuff is the boring part and the section I don't have much to talk about, relative to the (not yet written) shader section, where I did a bunch of things differently from Nanite. https://github.com/JMS55/jms55.github.io/blob/a3c272a4563675c6888cfce6565784af1c552b0c/content/posts/2024_05_10_virtual_geometry_bevy_0_14/index.md.
I would guess that "dryness" will varies per reader. Some people may prefer a straight-to-the-point article, and it might be easier to read for non native speakers.
Taking myself as an example, I try to write things as I would say them, but I tend to prefer focused articles if I want to learn a technique*. Furthermore, english is not my native language but I found your introduction easy to read and understand, and the other parts are straight-to-the-point (that does not mean they are hard to read nor understand!), as you write it in the article "in favor of just describing the algorithm itself". That's a complete valid reason, you already have a lot to write about!
Maybe add images to visualize various concepts?
Images are great at getting the point across and lets you re-reference them later if you need them
Thanks for the feedback. I'm glad it was understandable, I was worried just giving a brief description of things was not enough, but there's so much to write about 😅. Btw your English is great!
Yeah I have a few placeholder TODOs for images, maybe more would be helpful
If the description was too brief, you can always make a follow up article or updates to your article
True
I liked the write-up. I found it quite understandable and didn't mind the "dryness" at all
Now I just gotta wait for the to-do sections hehe
Thanks for the feedback! Working on it in parallel with actually writing new code 😅
@wicked notch John Khronos is reaching out for help https://discord.com/channels/427551838099996672/1151556776379023452
hehe
man they create new shit every week
now new PBR tonemapper
3d commerce bs
anari nonsense
whatever that kamaros thing is
Same, which made me think this was a wikipedia article about John Kronos life achievement...
Terminology and 1 pass of my renderer written about for the blog post - 8 more passes and the future work section left 😅
About the first pass (I don't know much about Bevy's architecture): do you really have to compute the instance <-> meshlet connection every frame?
Can't you "just" compute this mapping once at load and then reuse it?
At load technically yes, but I want to support dynamically adding and removing entities.
Unreal's solution is better though, I'm going to copy that. It's going to be part of the future work section.
Instead of dispatching per cluster, build a BVH over the lod tree, and traverse the tree per instance instead with persistent culling.
You could compute the mapping for the entity when it gets added, right?
Removing entities makes it more difficult. Then you have gaps in the buffer and can no longer map cluster id to instance id just by indexing.
Oh right
Day ? of ??? of writing. Did half a shader, didn't even manage a full pass 😅. Tbf cluster culling is a lot.
@wicked notch @dull oyster I've finished writing the first draft of my frame breakdown. Not published to my blog yet, but here's an early access link: https://github.com/JMS55/jms55.github.io/blob/meshlet-wip/content/posts/2024_06_08_virtual_geometry_bevy_0_14/index.md
I'd very much appreciate if you could give it a review! Criticism very much welcome. It's hard to know what other people will intuitively understand or what needs more explanation when writing in general, and this is an extremely technical topic that I tried to avoid spending 10 years writing 😅 .
I'll give it a read later tonite
that's pretty incredible, thanks for writing that, that's immensely valuable to learn how you implemented virtual geometry and some of these concepts, I hadn't heard of a depth pyramid before or virtual geometry
First off, congrats on writing about the entire pipeline!
Overall I think the article is very interesting.
I was on mobile so I could not check the code easily but here is my feedback:
-
"In the example with the wall with the rocks and trees behind it, we could see that last frame the wall clusters contributed pixels to the final image, but none of the rock or tree clusters did. “
Maybe I missed something but I could not see this example? -
I like the discussion of why a single indirect draw was chosen
-
Writing different vertex indices for the material pass is smart
something which could/should be presented at GPC in november 😉
@dull oyster + @wicked notch + @delicate rain https://www.graphicsprogrammingconference.nl/?v=4 dunno if @primal shadow is from europe too hehe
November? Not sure I'll be very available 😬
calling for papers ends at end of june (can probably be extended)
November 12-14, 2024
just saying
Maybe I missed something but I could not see this example?
It was two paragraphs ago, the first one in that section
"I mentioned earlier that frustum culling is not sufficent for complex scenes. With meshlets, we're going to have a lot of geometry in view at once. Rendering all of that is way too expensive, and unnecessary. It's a complete waste to spend time rendering a bunch of detailed rocks and trees, only to draw a wall in front of it later on (overdraw)."
Glad you liked it!
I'm American 😅 . But also not sure I've done much unique, it's msotly copied from Nanite.
Now that's the trailer's out I can answer: I will be too busy shipping Flight Sim 2024 to participate or event attend GPC 😢
i can send a mail to your manager if you want 😛
It's published now: https://jms55.github.io/posts/2024-06-09-virtual-geometry-bevy-0-14
keep learning new things on reread, frame budgets!
also using previous frame data to do occlusion culling is something I've been reading about elsewhere, it's super interesting, and I have no idea how to do it yet
but typically it would be that you produce visibility of objects for the next frame
so 1 bit per object that you write to this frame and read from the next frame
Read about 2 pass occlusion culling.
This is the article I recommend https://medium.com/@mil_kru/two-pass-occlusion-culling-4100edcad501
The only thing is it suggests using int mipLevel = floor(log2(max(AABB.pixelWidth, AABB.pixelHeight)));
I had bugs with that. I instead used max(0, ceil(log2(max_width_or_height))
You also don't have to store an explicit 0/1 visible bit between frames. I think it's easier to use last frame's depth pyramid.
thank you
np
you could add +1 to the floor one
Who’s thread is this?
lvstri's
it looks like he posts 4 times a month lol
This is the lvstri fan fan club.
We all gather in the hopes of LVSTRI showing up
Still didn't get my gpu signed by him
lustri is also neck deep in uni shit
I am officially done today
passed the last exam two hours ago 😄
Time to get back to the VSM struggle 
woohoo
Congratulations!
ah sorry, done with this semester
: )
so they say

he is
welcome to nanite HQ
Piece by piece we reassemble unreal engine 5
And once we manage it, ue6 will release, featuring a fully ai driven rendering pipeline
And we switch to being embedded developers
So it was pointed out to me that Tencent made a Nanite-like system on mobile
They mention they used 32 bit atomic buffers for software rendering
But this has got to be one of the least informative GDC talks I've ever seen. They explain nothing
So frustrating. I really want to know how they made 32 bit atomics work
i was under the impression that nanite used atomics as well
is there another way for software rast?
perhaps, it depends on how you want to tune your virtual geometry
if you don't need pixel sized triangles then you could entirely skip the soft rast
otherwise atomics are kinda necessary because you need to min on depth
is there some trick other than min/max and output - i cant imagine it is too complex
hmm the objective in the end is to write depth in some way
how would one emulate the depth test without min?
per-pixel linked list of fragment outputs, then sorting
Someone came up with one in the bevy discord.
- Atomic read cluster+triangle ID (aka visbuffer ID) from 32bit buffer
- Atomic min the depth in a second 32bit buffer
- AtomicCompareAndExchange writing the new visbuffer ID, and comparing using the visbuffer ID from step 1
- Retry if exchange fails
It was something like that, I forget the exact details
@primal shadow do you also stream and store partial lods in this impl?
so basically a cas loop 
Yep
Nope. Streaming is long term. Mostly because there's other more important stuff, and there's a lot I haven't figured out yet:
- Do I store vertex data per meshlet? Rn I store it per mesh and each meshlet references it
- Do I need to wait for transfer queues? Is using one main queue and blocking rendering work while mesh data gets transferred ok?


