#Iris - A Journey through OpenGL and beyond to learn Graphics

1 messages · Page 20 of 1

glass sphinx
#

i mean just looking at the capture it seems the wrote for indices is 4x the drawtime

primal shadow
#

Anyone know where I can upload a 35mb trace?

frank sail
#

Discord

wicked notch
#

should be fine here no?

#

I think the limit is 100MiB or something

primal shadow
#

it's 50mb

frank sail
#

It depends on the server

primal shadow
#

But also, can you explain the workgroup barrier omission first?

glass sphinx
#

i think my draw was around 20% slower but most of my meshlets were relatively full

#

but the compute part to write was way faster

frank sail
glass sphinx
primal shadow
#

They can't, I'd be growing the count too much. It's important that it's only incremented once.

#

I do have acess to subgroups though btw, as of this week

wicked notch
glass sphinx
#

instead of vertex count, each thread adds one

primal shadow
#

oh

wicked notch
#

btw is it normal that your vkCmdDispatch is 232x232x232? bleakekw

glass sphinx
#

kekek

primal shadow
#

since it's 1 wg per cluster

wicked notch
#

huh weren't you doing the two pass kinda thing?

#

which is muuuch better since you avoid that early out

#

you also avoid spawning ludicrous amounts of compute invocations bleakekw

primal shadow
wicked notch
#

wait so you have 232x232x232 surviving meshlets?

#

wot

primal shadow
#

No, It's not an indirect draw

wicked notch
#

ah

#

then you're definitely wasting time innit

primal shadow
#

It's on my list to write a compact buffer of surviving cluster IDs and make it an indirect dispatch, though

#

Or just combine the two passes if I can

#

maybe do some sort of subgroup/workgroup prefix sum to collacese the writes in the culling shader

#

sooo many possible options, tedious to test every one

primal shadow
# glass sphinx instead of vertex count, each thread adds one
// Reserve space in the buffer for this meshlet's triangles
draw_index_buffer_start_workgroup = atomicAdd(&draw_indirect_args.vertex_count, 1u);
draw_index_buffer_start_workgroup /= 3u; // ????

// Each thread writes one triangle of the meshlet to the buffer slice reserved for the meshlet
let meshlet_id = meshlet_thread_meshlet_ids[cluster_id];
let meshlet = meshlets[meshlet_id];
if triangle_id < meshlet.triangle_count {
    draw_index_buffer[draw_index_buffer_start_workgroup + triangle_id] = (cluster_id << 8u) | triangle_id;
}

This can't be sane, right? So atomicAdd() is going to return the same value for every thread??

#

uhh, the value before all the adds in the workgroup, or after?

glass sphinx
#

wait

#

i might be retarded (i definetly am)

#

i need to go on pc one sec

wicked notch
glass sphinx
#

you just use the offset every thread gets locally

glass sphinx
#

the atomic add retuens a different value for each one

wicked notch
#
const bool is_valid_primitive = gl_LocalInvocationID.x < meshlet.primitive_count;
const uint local_offset = subgroupExclusiveAdd(uint(is_valid_primitive));
uint global_offset = 0;
if (subgroupElect()) {
  global_offset = atomicAdd(..., (local_offset + 1) * 3);
}
buffer[global_offset + local_offset] = ...;```
#

I think this is correct?

glass sphinx
#

you can remove the local offset and the subgroup elect as well

primal shadow
#

oh, huh. Don't actually enforce consecutive locations in the buffer are for the same cluster, ok

glass sphinx
#

i mean it will be

#

cause the hw will do it with subgroup ops

primal shadow
#

right

glass sphinx
#

but yea you can do it like lvstri showed for piece of mind

#

lvstri your forgot a subgeoup broadcast first

#

🫨

wicked notch
#

ah you're right

#

it's joever

#

I'm not well versed in subgroup magique

glass sphinx
#

subgeoup majonaise is my favourite dressing

primal shadow
#
@compute
@workgroup_size(64, 1, 1) // 64 threads per workgroup, 1 workgroup per cluster, 1 thread per triangle
fn write_index_buffer(@builtin(workgroup_id) workgroup_id: vec3<u32>, @builtin(num_workgroups) num_workgroups: vec3<u32>, @builtin(local_invocation_index) triangle_id: u32) {
    // Calculate the cluster ID for this workgroup
    let cluster_id = dot(workgroup_id, vec3(num_workgroups.x * num_workgroups.x, num_workgroups.x, 1u));
    if cluster_id >= arrayLength(&meshlet_thread_meshlet_ids) { return; }

    // If the cluster was culled or already drawn in the first pass, then we don't need to draw it
    if !meshlet_should_draw(cluster_id) { return; }

    // Exit this thread if the triangle slot is empty in the meshlet
    let meshlet_id = meshlet_thread_meshlet_ids[cluster_id];
    let meshlet = meshlets[meshlet_id];
    if triangle_id >= meshlet.triangle_count { return; }

    // Write cluster ID + triangle ID
    let i = atomicAdd(&draw_indirect_args.vertex_count, meshlet.triangle_count * 3u) / 3u;
    draw_index_buffer[i] = (cluster_id << 8u) | triangle_id;
}

Yeah?

#

What's the subgroup stuff for in yours?

#

uh oh, something is broken

#

oh whoops

#

Should be let i = atomicAdd(&draw_indirect_args.vertex_count, 3u) / 3u;

glass sphinx
#

you should add 1

#

not tri count

#

or 3

#

yes 3

#

and then div by three

primal shadow
#

Ok I mean it renders fine, but it appears to be exactly the same perf...

glass sphinx
#

when you add 3?

primal shadow
#

Code is a bit simpler though so that's cool atl

glass sphinx
#

yea then its something else slowing it down

primal shadow
#

Why am I only at 20% occupancy :/

wicked notch
#

send spirv I guess KEKW

#

maybe wgsl compiler is generating absolute retard code

#

but I doubt

glass sphinx
#

i am downloading nsight nr

#

rn

#

i only have my laptop as im on a worktrip

#

also l´slow hotel internet

primal shadow
#

NSight also says this I noticed

#

Is that just because I have perfect occlusion culling lol

#

And it think it means culling is doing nothing?

glass sphinx
#

no it says its disabled

#

thats really strange

#

do you use discard_

primal shadow
#

What does it mean?

glass sphinx
#

?

primal shadow
glass sphinx
#

show frag shader

glass sphinx
#

nsight is expecting a 3.3x speedup with early z enabled damn

primal shadow
#

It's for the shadow view only, the main view dosen't have this issue

glass sphinx
#

ah you write frag deprth

primal shadow
#

Right, it's rasterizing depth

#

Eventually when wgpu gets image atomics, I'm just going to use that

glass sphinx
#

im confused

#

why do you need to write it like that

primal shadow
#

Yeah this bit:

#ifdef DEPTH_CLAMP_ORTHO
@fragment
fn fragment(vertex_output: VertexOutput) -> @builtin(frag_depth) f32 {
    return vertex_output.unclamped_clip_depth;
}
#endif
wicked notch
#

you could just enable depth clamp in your pipeline state

glass sphinx
#

your frag shader could be just empty no_

#

?

primal shadow
#

This is from somewhere else in bevy, let me figure out why that has to be a thing

glass sphinx
#

or write a color attachment if its special

#

no its not

#

this will write the depth buffer

#

it will disable early z

#

as you could change the value to be smaller then it is when early z runs invalidating it

#

bleakekw bevy going hard i see i see

#

i think you can just make the frag shader empty

#

return nothing

primal shadow
#

Ok I can't even find a PR where DEPTH_CLAMP_ORTHO was added

#

wtf is it doing?

wicked notch
#

depth clamping is necesary for some shadow mapping techniques

#

but it's an option in your rasterizer state that you can just enable, there's no reason to do it yourself

primal shadow
#

I feeeelllll like we tried that, hrm

#

let me try it though

glass sphinx
#

btw what gpu do you have_

#

?

#

i dont get how the occupancy is so low

primal shadow
#

Rtx 3080

primal shadow
buoyant summit
#

vkCmdSetDepthClampEnableEXT

primal shadow
#

I don't think wgpu exposes it xD

wicked notch
#

bruh

primal shadow
wicked notch
#

When you set DepthClipEnable to FALSE, the hardware skips the z clipping (that is, the last step in the preceding algorithm). However, the hardware still performs the "0 < w" clipping. When z clipping is disabled, improper depth ordering at the pixel level might result. However, when z clipping is disabled, stencil shadow implementations are simplified. In other words, you can avoid complex special-case handling for geometry that goes beyond the back clipping plane

#

how does anyone unironically use D3D12

glass sphinx
#

you will loose a ton of perf if you can not do early z

#

but you can force it on in most apis

primal shadow
#

I'll add it to my list of "things that will be faster when X wgpu issue is closed" :/

glass sphinx
#

LGPU

buoyant summit
wicked notch
#

I think MS is meming

buoyant summit
#

I don't see vk spec talk about w, it seems to be operating after / w

#

so it's talking about clamping z/w

wicked notch
#

yes

#

as it should be

#

idk what the hell D3D12 wants to accomplish here

buoyant summit
#

what this tells me is that we don't have enough CTS in this area

#

in vk

#

or I'm misunderstanding something

primal shadow
#

This shouldn't matter once I add software raster anyways, as I'll be directly blitting the depth in the fragment shader

#

Ok, so, raster is not going to get any faster anytime soon until I get software raster or mesh shaders

#

Back to writing out an index buffer

wicked notch
#

did you profile your shader yet

primal shadow
#

Oh no, let me go do that

#

I got distracted by the zcull lmao

primal shadow
buoyant summit
#

idk what it sounds like

glass sphinx
#

jasmin could you try increasing the wg size for me in the write index buffer shader?

#

btw you hsoul default to 128 or 256 size wgs

buoyant summit
wicked notch
primal shadow
#

Idk I'm lost by it lol

glass sphinx
#

the 3080 cant reach peak occupancy for workgroups smaller then 96 threads but with 128 it can reach max launch speeds for warps as well

glass sphinx
#

amd also preferes the same numbers per default

#

64 is good for time - diverget work

primal shadow
#

ok, I'll switch culling to 128, that's easy

glass sphinx
#

but for things like this that finishes fast its better to have larger groups

primal shadow
#

so maybe I'll keep it 64 lol

glass sphinx
#

how is cullign divergent

#

its not really

primal shadow
#

1 thread per cluster

#

some early exit on frustum culling or w/e

glass sphinx
#

with divergent i mean massive differences in runtime between threads like 10x

primal shadow
#

oh ok

glass sphinx
#

for example compute raytracing for ssr or ssao

#

thats mega divergent

#

culling should be pretty cozy

primal shadow
#

Ah, thank you. Yay for learning.

wicked notch
#

you retire the entire workgroup at once with culling

#

zero divergence frogapprove

primal shadow
#

Ok so, profile write_index_buffer, then try increasing the thread count

glass sphinx
#

somehow the compute job acts as tho you have workgroups the size of 32 or so

#

in nsight it looks like that at least

primal shadow
wicked notch
#

I mean the write index buffer

#

my bad

#

culling speed is not an issue is it

#

even if you diverge like that it's pretty minor

primal shadow
#

yeah cullign speed is like 0.3ms on this insane cluster count

#

it's nothing

#

raster + write index buffer + per frame data uploads are the issue (ignoring shading, currently broken, and inefficent anyways :P)

glass sphinx
#

this is why we love mesh shaders

#

and soon work graphs with mesh nodes

#

warp drive

primal shadow
#

yeahhhh

#

not going to get that for a looong time

wicked notch
glass sphinx
#

my fingers are itching

#

jasmine can you run with 128 wg size

primal shadow
#

It takes a good 5m per trace because there's so much data lol, give me a few mins 😅

#

I'll have to switch to 1 frame per trace, even 2 takes forever

glass sphinx
#

oki

primal shadow
#

ALU bound??

glass sphinx
#

btw @primal shadow its ub to return from a shader before entering barriers

#

all threads must enter the same barriers

primal shadow
#

whuh

glass sphinx
#

or does wgpu do some magic?

primal shadow
#

I assume wgpu is protecting me here

#

Cause I don't get any errors from them

#

Actually no, meshlets don't use barriers anymore

glass sphinx
#

yea it will work

primal shadow
#

Except for downsample depth hiz

glass sphinx
#

most of the time

primal shadow
#

when I just copied from SPD

glass sphinx
#

but then you wll die randomly

wicked notch
primal shadow
#

Idk how to export the shader, so let me upload the trace again ig

glass sphinx
#

yea that would be sus as then shading langs should do that also

glass sphinx
#

how did it nt change

#

very strange

primal shadow
#

this is with 64 wg still

buoyant summit
glass sphinx
buoyant summit
#

webgpu also leaves that to be ub

left jacinth
#

good

glass sphinx
#

no safe in gpu land

wicked notch
#

safety memes don't work in gpu land

glass sphinx
#

i connected to gabes neuron to help

wicked notch
#

checkmate

primal shadow
#

I wish webgpu was more like wgpu_hal sometimes :/

#

I am 4mb over the limit even when compressed ahhh

buoyant summit
#

nvm I was overthinking it

primal shadow
#

anyways this is the only expensive part of the shader

buoyant summit
#

but ye webgpu does not protect you from ub

#

you will get ub

#

it's extremely infectious

primal shadow
#

and a little here

glass sphinx
#

ok i have a suspicion

buoyant summit
#

reminder that lgsb means waiting for global memory op

glass sphinx
#

does it run with a workgroup size of 32?

#

and loops?

#

can zou see that in the spirv?

primal shadow
glass sphinx
left jacinth
#

long scoreboard (waiting on a read) doesn't have anything to do with why the workgroup is size 32

buoyant summit
#

it can be texture samples, loads, stores, atomics, whatever, but operating on global memory (or scratch/private/stack/whatever too)

left jacinth
#

(presumably is size 32 with a loop, because there's no other explanation)

glass sphinx
#

jasmine in the spv at the top can you give us the workgroup sizes?

primal shadow
buoyant summit
#

11, OpExecutionMode %write_index_buffer LocalSize 64 1 1 ; 0x00000098,,,,,0,,,

#

64x1x1

glass sphinx
#

yea

#

hmmm

#

i wonder if 128 size changes anything

primal shadow
#

let me go try

#

2 clusters per wg now, give me a few minutes to rewrite things

glass sphinx
#

ok

#

i suspect its launch bound, so the gp cant launch new workgroups fast enough and the warps retire so fast that the gpu cant go full saturation

#

so making the wg bigger should help a lot then

primal shadow
#

nsight won't tell me that?

glass sphinx
#

at least to increase occupancy

#

i think it did say that at some point but they removedthe metric

#

they changed it a lot last year very annoying

primal shadow
#

Changing it to an indirect dispatch over surviving meshlets only should help I believe

#

or ideally just combine the pass into the culling pass

primal shadow
#
Write index buffer is wayyyy too expensive. Ideas/options:
* Increase WG thread size
* Write out a buffer of surviving cluster IDs, using an indirect dispatch
* Combine culling and writing index buffer, somehow
* Use multi draw indirect, 1 IndirectDrawArgs per cluster
* Use a fixed size 64 * surviving_cluster_count draw, write out a buffer of surviving cluster IDs, and then have the vertex shader write NaN for excess invocations
left jacinth
#

launch bound makes sense in that case I believe. I'd try looping to handle multiple things in 1 compute thread

#

but use 128 wg size either way

#

horrid ass 30 series cards

glass sphinx
#

lol

#

yea 128-256 is a good default for things like this

primal shadow
glass sphinx
#

all other nv and amd gpus

primal shadow
glass sphinx
#

have a ratio of 2:1 for warp to workgropup slots on each sm

#

only the 30 series has a ratio of 3:1

#

they did that so that they can have more registers for raytracing for each thread but it turned out to bea shit tradeoff

left jacinth
wicked notch
left jacinth
#

I daily drive a 3070 🤡 🔫

glass sphinx
#

to be clear tho, having larger wg sizes for these workloads is better on all gpus

wicked notch
#

me when I have a 3070 too

glass sphinx
#

its jsut that the 30 series can never ever reach full occcupancy at all with 64

primal shadow
# left jacinth both

Write surviving cluster buffer + indirect dispatch to get rid of "spawn 1 wg per cluster, just exist immediatly if culled" is also an option

#

Hard to say 😅

glass sphinx
#

i will check your shaders tomorrow

wicked notch
#

I use 256

left jacinth
#

just cope

glass sphinx
#

nice

buoyant summit
#

this bad boy has 2 MMUs

#

does your gpu have 2 MMUs?

#

that's right, it only has one

wicked notch
#

nano can unlock the true potential of VSM

#

it's called vkQueueBindSparse

buoyant summit
#

honestly idk how the fuck queue bind performance still sucks after 13 years first gpus shipped with this feature

#

on windows idk what's the problem but elsewhere certain drivers oversync to hell

glass sphinx
#

its very sad

primal shadow
#

hazah, write_index_buffer is now ~5.4ms using wg=128, 2x as fast!

#

....still very slow 😅

#

256 time now

glass sphinx
#

niiice

#

in that case it really is bound by launch speed and the used warp slots

primal shadow
#

I did the math here correct, right?

var cluster_id = 4u * dot(workgroup_id, vec3(num_workgroups.x * num_workgroups.x, num_workgroups.x, 1u));
let triangle_id = local_invocation_index % 64u;
cluster_id += local_invocation_index / 64u;
#

And is the compiler smart enough to combine the % and /?

#

4 clusters per workgroup

#

64 triangles per cluster

wicked notch
#

you could easily just do >> 6 and & 0x3f if you don't trust your compiler

glass sphinx
#

yes the compiler will find that

left jacinth
#

the compiler will ABSOLUTELY do that for you

#

so long as you're dividing by a compile-time constant

glass sphinx
#

if its not you die by stroke

left jacinth
#

if it doesn't do it for you, abandon wgsl. Don't care.

glass sphinx
#

(the divisor is not constant, you will be asleeped by the compiler forever)

glass sphinx
buoyant summit
#

the other day I heard about a d3d11 game doing 1D dispatches and then in the shader it was translating that to 2D coordinate by doing stuff like % thingy.GetDimensions().x etc

glass sphinx
#

then the math will be simpler

primal shadow
#

oh, true lol

buoyant summit
#

they are literally doing int div by a non-constant divisor

glass sphinx
#

they are amazing i will hug them

wicked notch
#

potrick is MS spy

buoyant summit
#

hug of death

primal shadow
#

Slightly faster with 256 WGs, 4.3ms now

glass sphinx
#

niiice

#

no more launch bound

primal shadow
#

What am I bound by now? Hard to tell

#

it still shows LGSB on bitfield extract

wicked notch
primal shadow
#

it's also still the same speed in both passes

#

so yeah, wasted invocations...

glass sphinx
#

wated invocations are ok

#

as long as its a good tradeoff compared to other options

primal shadow
#

Memory latency, where? The atomic write? the looking up meshlet data?

primal shadow
# glass sphinx as long as its a good tradeoff compared to other options
  • Multi draw indirect is too slow, bottlenecks at command processing, single draw indirect is neccesary
  • Using fixed-size increments of 64 vertices and outputting NaN for excess triangles slows the draws down by a decent amount, but it's an option
  • Somehow combining culling and write index buffer shaders are an option, but difficult to get working well, as each thread is writing a different amount of vertices
glass sphinx
#

atomic and writing yea

primal shadow
left jacinth
#

web is useless

#

don't care

primal shadow
#

I think my best bet is either combine the shaders if I can, or make write_index_buffer an indirect dispatch

left jacinth
#

garbage nonsense limitations don't allow your shaders to be limited just because Apple exists

#

or cope that that's why it's so slow

primal shadow
left jacinth
#

yeah I'm just raging

buoyant summit
#

afaiu you can just throw wgpu out of bevy?

#

just not use it

#

you mean you're working on the engine

primal shadow
left jacinth
#

I don't get how people aren't more upset about the limitations on web

#

like you are trying to make a nanite-like renderer

#

you absolutely need to utilize the hardware as much as possible for that to even be viable

#

mesh shader extensions were added to Vulkan SIX years ago

#

that's 2018. Vulkan was initially released in 2016... Mesh shaders are almost as old as Vulkan is 💀

#

but yeah web has no support 😐

#

patrick no

glass sphinx
#

🤓 👆 the cross platform extension came 2022

left jacinth
#

*for vulkan, yeah

glass sphinx
#

web is always needs to support all the bad old hw

buoyant summit
#

wgpu isn't restricted to web, they have some exts in there not available on "the web"

#

but yeah it seems like it's another layer of bikeshed

glass sphinx
#

yes

primal shadow
#

Fixed large bug, down to this now

glass sphinx
#

nice

#

still quite heavy

primal shadow
#

mhmm, write_index_buffer is still chunky due to all the extra dead workgroups

left jacinth
#

Very well done, good job!

primal shadow
#

and the 8ms 2x copy at the start is due to lack of ReBAR support

glass sphinx
#

wait

#

is the write index buffer not indirect_

#

oooh i see

primal shadow
#

nope

glass sphinx
#

yea hahaha

#

ok i think that will make it much much faster

primal shadow
#

Right, but now the culling pass needs to output a buffer of surviving clusters 🙂

#

and aotmic increment an indirect dispatch

#

I'm going to be sneaky and reuse the index/triangle/primitive/thing buffer for the surviving cluster buffer though

#

Maybe, idk

glass sphinx
#

👍

primal shadow
#

Is there a way I can avoid the large data upload at the start of the frame? For each cluster in the scene (there might be millions), upload 2 u32s (meshlet ID, and instance ID). Is there a way I can make this persistent, maybe? Idk.

#

Instance ID tells the cluster what instance/entity transform to read

#

Meshlet ID tells the cluster what meshlet asset it's an instance of

primal shadow
#

No matter what I do, I'm going to need more buffers 😭

#

Buffers for telling software raster what cluster to use, and same for hardware now

#

I mean 4 bytes per cluster is not much, but dynamically managing the buffer size is sucky

wispy spear
glass sphinx
#

its way too big considering pcie speeds

#

you should have a list of meshlets per mesh, a list of meshes per entity and a list of entities.
Then expand from entity, to meshlist to meshes to meshlets

#

uploading 8 bytes per meshlet a frame will never scale its way too much data

glass sphinx
#

didnt you have that already and nanite like lod selection? maybe im confusing projects

#

wait we are in lvstris channel the whole time bleakekw

left jacinth
#

lmao

#

is that deccer from DeccerCubes

glass sphinx
#

yes

#

lmao

left jacinth
#

they named a user after DeccerCubes

glass sphinx
#

they named a cartoon after a shadow artefact

primal shadow
glass sphinx
primal shadow
glass sphinx
#

for you its probably not too much

primal shadow
glass sphinx
#

there are multiple ways to do this

#

the simplest is prefix sum + binary search

#

in this gist i link example code for binary search + prefix sum

primal shadow
primal shadow
buoyant summit
glass sphinx
#

its much better then uploading all from cou bleakekw

buoyant summit
#

wgpu sync moment

glass sphinx
#

you can go hard and do the power of two argument buffers

#

thats very fast

primal shadow
glass sphinx
#

wgpu also spamms hash maps

buoyant summit
#

the correct answer is not tracking

glass sphinx
#

and lazyliy caches so much

#

its really bad imo

#

dont like

buoyant summit
#

ye

primal shadow
#

The power of 2 argument buffers

glass sphinx
#

one sec

#

the idea is

#

for example

#

meshlet cull shader feeds meshlet draw shader

#

i will answer soon

primal shadow
#

No rush, thanks

glass sphinx
#

sorry other way

#

so mesh cull shader, each thread does one mesh

#

expands to meshlet cull shader

#

each thread does a meshlet

#

the eypansion between them is asymmetrical

#

as meshes have differing amounts of meshlets

#

the idea now is

#

to use a property of binary numbers

#

in the mesh culk you take the bumber of meshlets, and write one of 32 buffers, for each bit set in the meshlet count number

#

this way, instead of writing n values where n is meshlet count

#

you write up to 32 times, one time per set bit

#

then these 32 buffers contain a aegument each

#

deoending on their "power" each elemnt desceibes 2^n work items

#

so then

#

you do 32 dispatch indirect

#

for each arg buffer

#

for the meshlet cull shader

#

and each thread can simply bitshift their id by the power

#

to get the argument

#

its a little involved

#

hard to explain fir me

primal shadow
#

Uhh so let's say you have 10 meshlets

#

And it gets culled down to 4

#

Can you give a practical example of what it would look like?

glass sphinx
#

yea

#

id you have a mesh with 12 meshlets. It will first cull the mesh if it survivies it will write arguments to cull 12 mehslets

#

thats 1100 binary

#

so it will write the 3rd and 4th buffer

primal shadow
#

what if you have 2 meshes with 12 meshlets?

glass sphinx
#

then

primal shadow
#

Do you need 32 buffers per mesh?

glass sphinx
#

no

#

you append to them

primal shadow
#

append how?

glass sphinx
#

um

#

i ll make an example when im homr

#

sorry i cant rn did an oopsie in planning

primal shadow
#

nah nw, I'll keep thinking on it. Thanks!

glass sphinx
#

Ok

#

so an example would be

#
Surviving Meshes meshlet counts: 
mesh0:   6 meshlets              
mesh1:   3 meshlets              
mesh2:   1 meshlet               
mesh3:   5 meshlets         
mesh4:   7 meshlets     
#
Each bit in the meshlet count maps to an arg buffer:

Surviving Meshes meshlet counts: | arg buffers appended:
mesh0:   6 meshlets              | 00000110
mesh1:   3 meshlets              | 00000011
mesh2:   1 meshlet               | 00000001
mesh3:   5 meshlets              | 00000101
mesh4:   7 meshlets              | 00000111
#

in this example we have 8 arg buffers

#

so the arg buffers will contain

#
arg buffer 0:
[mesh: mesh1, meshlet offset: 0] [mesh2, 0] [mesh3, 0]
arg buffer 1:
[mesh0, 0] [mesh1, offset: 1]
arg buffer 2:
[mesh0, 2] [mesh3, 1]
#

each argument is a pair of mesh index and meshlet offset into the mesh

#

so when for example looking at mesh0, you will append to arg buffer 1 and arg buffer 2

#

into 1 it will append mesh0, offset 0 and into arg buffer 2 it will append mesh0, offset 1

loud crag
#

you using metal or why are you calling these argument buffers

glass sphinx
#

arguments is the naming convention for some renderers for buffers that determine what threads in indirect dispatches map to

primal shadow
#

So max of 8 meshlets per mesh, so 1 + ceil(log2(8)) = 4 argument buffers?

glass sphinx
#

yes

#

in this example with 8 you could do 2^8-1 meshlets per mesh

primal shadow
#

Then after culling, if there are N meshlets left for a mesh, append this mesh to the argument buffer X for all set bits X in N?

#

And ok, you append the mesh ID to each of those argument buffers

#

But what are the meshlet offsets?

glass sphinx
#

these arg buffers are then read by meshlet culling

#

all meshlets of a mesh go into arg buffers

primal shadow
#

Right, I don't actually do mesh culling, I want to use this for meshlets culling -> triangles, but same concept so anyways

glass sphinx
#

it tells the shader at what point to start working on the mehslet list

#

so for the example of mesh3:
first arg is in arg buffer 1: [mesh3, 0].
second arg is in arg buffer 2: [mesh3, 2]
in the meshlet cull, the threads read their dispatches power, so arg buffer 1 has power 2, arg buffer 2 has power 4.
they div their thread count by power and mod it by power.
the dived value is the arg index, the moded value is the offset inside the arg.

#

the first arg isin buffer 1

#

starting at 0

primal shadow
#

What does starting at 0 mean though

#

I'm lost on what the offsets are for

#

Also, won't this ruin cache efficiency, if you're spreading the meshlets for a single mesh out across workgroups?

glass sphinx
#

two threads will cull in arg buffer 1 dispatch

#

then in the arg buffer 2 dispatch, 4 threads will work on the rest of the meshlets

#

but they need to start at meshlet index 2

#

as the first arg buffer dispath alreadz worked on the first two meshlets

#

technically

#

you can also get that value by masking the mehslet count lower then the powers index

#

i will improve the example

primal shadow
#

Please do, I am lost 😅

glass sphinx
#

yea its hard to understand

#

a lot of bit twisting and indirections

#

i thing better examples

#

will make it clear

glass sphinx
#

I dont have mroe time today

#

i will expand this next week

delicate rain
#

"I don't have time" says bro while posting a detailed elaborate graph with colors and explanations

glass sphinx
primal shadow
#

ooooh ok this makes sense now!
Each item in an argbuffer represents 2^i units of work
hence the mesh_id of where it came from, and the meshlet_offset

So given the 13 meshlets in your example
it would be:
argbuf0: offset=0..=0
argbuf2: 1..=4
argbuf3: 5..=12
except you only need to store 0, 1, 5

ok two questions then:

  • How is this cache efficient, if you split the meshlets for a single mesh between many argbuffers?
  • Isin't this very lopsided work-wise? Each argbuffer has exponentially more args to process. Are you supposed to then take that mapping, and redistribute it somehow across the threads? Or is it fine
    Also I need to go figure out how many threads you spawn to process each argbuffer, and how it'll work with multiple meshes
delicate rain
#

Regarding the second point, I believe you'd just launch the appropriate number of threads for each buffer, so that each thread gets one

#

First question I don't understand, I doubt cache comes into play on a meshlet granularity

primal shadow
#

meshlets need to lookup their mesh's transform and such

#

I also want to try and use this for meshlet->triangle writing

glass sphinx
#

it will also make the atomics much slower

#

as you need to do more of them

#

but for large numbers the bandwidth and divergence benefits win

#

i didnt test much but meshlet count over like 8-32 were better already with the po2 buffers

#

any if you have larger meshes its muuuhc better

#

I dont get point 2

#

how would it be lopsided

primal shadow
#

Ok no I realized something

#

it's not 1 thread per arg

delicate rain
#

The idea is to launch more threads for bigger arg buffers

primal shadow
#

It's 1 thread per unit of work, of which there is 2^i units of work per arg

glass sphinx
#

you launch (argcnt * 2^(arg buff index)) threads per arg buffer

#

so you waste nearly 0 threads

#

the po2 solution is by far the fastest i could find so far for this workload btw

#

what could be much better are workgraphs

#

they solve this on an api level

delicate rain
#

Is cache locality that big of a deal on GPUs anyways btw? You have way more latency hiding compared to CPU, I don't really hear much about it

glass sphinx
#

its extreamly important

#

much much more then on cpus

delicate rain
#

Really, hmmm

glass sphinx
#

and its getting worse each gen

#

as compute gets faste rmuch faster then cache

delicate rain
#

Interesting yeh, I've not really been paying much attention to cache when writing shaders

#

Other than shared mem I guess

glass sphinx
#

its ver intrecate

delicate rain
#

There is so much to think about

glass sphinx
#

it can be better to shit on cache locality if it means you safe a ton of mem bandidth like in this case

#

but you can do a lot of warp level optimizations btw @primal shadow that make the cache locality pretty good again fo the writes

primal shadow
glass sphinx
#

yes

primal shadow
primal shadow
delicate rain
#

That reminds me btw @wicked notch if you make your pages 128 your mark visible pages pass (if you scalarize) will be > double the speed

primal shadow
#

especially because I have no way of overlapping them...

glass sphinx
#

64 dispatches are nothing

#

they will auto overlap

primal shadow
#

no wgpu will insert barriers 😭

glass sphinx
#

i really hope wgpu is not that crippled

wicked notch
glass sphinx
delicate rain
#

How fast?

primal shadow
#

each dispatch is treated as a "usage scope", and will ocmplete in order via barriers between them :/

delicate rain
#

Was like 1.2 ms without scalarization for me

wicked notch
delicate rain
#

Wtf

glass sphinx
#

not 64

delicate rain
#

How

wicked notch
#

I really do nothing special there

glass sphinx
#

show code

delicate rain
#

Do you do atomics or?

primal shadow
wicked notch
#

I do no atomics

glass sphinx
delicate rain
#

Ah no caching

#

Figures

glass sphinx
#

the arg buffers double the work each time

delicate rain
#

Nevermind then

primal shadow
#

log2(64) = 6*

wicked notch
#

ah yeah with caching it would be a problem bleakekw

wicked notch
glass sphinx
primal shadow
#

shhh

glass sphinx
#

keko

frank sail
#

6

primal shadow
#

ok maybe 6 is fine barrier wise then

delicate rain
glass sphinx
#

atomics scalarization makes it warp speed again

delicate rain
#

Young man does image store with no atomics with multiple threads writing the same spot

#

Is that legal?

wicked notch
#

it's ub that just works

#

one could do imageAtomicStore

delicate rain
#

And watch your perf plummet

wicked notch
#

yes kekkedsadge

delicate rain
#

Btw how do you allocate? You do it after?

wicked notch
#

ye

delicate rain
#

Loop through VSM and find all visible this frame and allocate

glass sphinx
wicked notch
#

ok I will fix

glass sphinx
#

is it jsut marking it?

#

like or iing a 1 on?

delicate rain
#

They all write the same ye

wicked notch
#

yes

glass sphinx
#

then atomic store might actually just be the same speed

#

on nv hw

#

could be that that stores of that size are inherently atomic

#

relaxed atomic writes on x86 are also completely free

delicate rain
#

Sad that I need to or

#

But I don't want to make another imaž just for this bitmask

glass sphinx
#

yea

wicked notch
#

it's a 2KiB image it's fiine

delicate rain
#

Another fetch is what makes me sad

#

Not the memory itself

wicked notch
#

the whole image stays in L2 cache

#

it's fiiine

glass sphinx
#

the tap is shit but if its so small its probably ok

delicate rain
#

You make convincing point

#

Alas it's not even near to being the bottleneck

glass sphinx
#

also the scalarization would still help btw

delicate rain
#

As Jaker would say, it doesn't matter for now

glass sphinx
#

less contention still

#

real

#

its fine like so

#

tido crashes on amd shader compiler crashes

#

goofy aah drivers

delicate rain
#

I just noticed the pass spiking to 130us when I reduced the page size to 64

glass sphinx
#

why 64

delicate rain
#

Compared to 30us for 128 page size

#

More pages == less scalarization

delicate rain
glass sphinx
#

yes why do you have more pages now

#

is it important

#

ah

#

okk

delicate rain
#

You theoretically get better culling

glass sphinx
#

eh

delicate rain
#

And tighter allocation

wicked notch
#

I also tried 64 but for a different reason

#

I wanted to see whether smaller pages could help with projective aliasing

delicate rain
wicked notch
#

(they don't)

delicate rain
#

It doesn't matter what size the page is

#

For the resulting shadow quality

#

The information density is the same

wicked notch
#

ye I realized that later KEKW

delicate rain
#

You get tighter allocation and culling

wicked notch
#

btw

#

did you figure out how to reduce the VSM

#

because there's a resolution difference between the VSM itself and the virtual resolution

delicate rain
#

Was?

#

I don't follow

#

Reduce for hiz you mean?

wicked notch
#

ye

delicate rain
#

I mean you just make MIPS of the memory block and reduce each page there (the mapping is preserved 64 wide page become 32 wide in MIP 1) and once you reach max MIP (each page is 1 Texel) you use that to build the virtual HIZ on top (starting at page table resolution and going again Down to a single texel)

#

That is my current working plan for implementing that

wicked notch
#

hmm I see

delicate rain
#

but these are future plans

#

I am stuck on fixing tido bugs and dynamic object shadows

#

I wanted to generate the invalidated pages mask on the GPU but the geo pipeline we have in Tido is hellish (and I still don't really understand it) so I'll prob cope with CPU approach

primal shadow
#

So uhh, if I just remove the write_index_buffer pass entirely, and go back to doing it from the culling shader itself, its much faster...

if meshlet_visible {
    let meshlet = meshlets[meshlet_id];
    let start = atomicAdd(&draw_indirect_args.vertex_count, meshlet.triangle_count * 3u) / 3u;
    let base = cluster_id << 8u;
    for (var triangle_id = 0u; triangle_id < 64u; triangle_id++) {
        if triangle_id >= meshlet.triangle_count { return; }
        draw_triangle_buffer[start + triangle_id] = base | triangle_id;
    }
}
glass sphinx
#

probably cause nearly all threads were dead in the write index buffer pass

#

for writing the indices i think it might be better to just loop btw

#

cause the numbers might be too low and similar to get any gains from po2 arg buffers

primal shadow
primal shadow
#

I've gotten it down to a very very reasonable amount of time 🙂 (minus the data uploads, stupid data uploads)

primal shadow
#

Ok, idea

#

We already have a list of instances

#

Run a compute shader to write out the per-cluster data directly

#

I could also slap on per instance culling in the same shader

glass sphinx
#

you should probably look at othe rpeojects for this as well

#

most of this has been done many times already

primal shadow
#

Occlusion culling was broken (accidently bound 1 mip, instead of every mip), fixed it. Still some overdraw, but that's always been the case. I need to tweak the occlusion culling some more.

primal shadow
#

Should save a lot of culling time, and also solve my data upload problem. I won't need to upload 8 byters per instance of a meshlet every frame. Instead, I can just upload 1 u32 per mesh instance pointing to the BVH root of that mesh's meshlet tree.

glass sphinx
#

you dont need persistent threads for that

#

you'll have to make a lockless queue and make a table of device to threadcount for it

#

very shaky

primal shadow
#

Well, the alternative is a bunch of dependent dispatches to traverse the tree...

primal shadow
primal shadow
#

So my test scene has 12_463_400 clusters

#

I upload 8 bytes per cluster each frame

#

PCIE x16 Gen3 bandwidth is 16 GB/s

#

12_463_400 * 8 ~= 99.7 MB

#

At 16 MB/ms, that's ~6.23ms of upload time

#

NSight shows 92% throughput, and overall ~8.4ms of upload time

#

So yeah, far too expensive

glass sphinx
#

yea thats horrible

primal shadow
#

I need to reduce this down to per-instance data upload

glass sphinx
#

you can also make it completely gpu driven

primal shadow
#

The power2 dispatch trick would work well, and then write out per-cluster data from a compute shader, but I would have to put a barrier between every compute dispatch because of wgpu (unless I hack around it?)

#

Alternatives include making the clusters form a tree of LODs, and writing out only the LOD tree ID per instance, and then making the shader traverse the tree instead of a flat list

#

Or uploading a per-instance prefix sum of meshlet counts per instance, and having all 3 shaders (cull, raster, shading from visbuffer) do a binary search to map thread ID -> instance ID :/

primal shadow
primal shadow
#

Oh wait, there's another option here

#

Upload the prefix sum yes, but then have a pass at the start perform the binary search and output flat buffers

glass sphinx
#

do zou do the lod traversal and nanite thing on cpu_

#

?

#

i mean you can try persistent threads

#

but i think that will be pain

primal shadow
primal shadow
#
  • On the CPU, compute a prefix sum buffer of meshlet counts per instance, and indices to the start of the meshlet buffer for that instance
  • On the GPU, run a new pass once at the start of the frame, 1 thread per instance of a meshlet (same as culling). Each thread (cluster) does a binary search on the prefix sum meshlet count list to find their instance ID, plus meshlet asset ID, and then writes it to the buffers for subsequent passes.
primal shadow
#
New buffers:
    meshlet_mesh_slices = [2 u32's per meshlet_mesh asset containing buffer slice]
    instance_meshlet_meshes = [map of instance -> meshlet_mesh_slices id]
    meshlet_count_prefix_sum = [scratch buffer of size instance_count]

Shader 1: <perform prefix sum of 1 thread per instance to fill out meshlet_count_prefix_sum>
Shader 2: <follow below of 1 thread per cluster to fill out thread_instance_ids and thread_meshlet_ids>

let cluster_id = global_invocation_id.x;
let instance_id = <binary search in meshlet_count_prefix_sums using cluster_id>;
let meshlet_mesh_id = instance_meshlet_meshes[instance_id];
let (meshlet_start, meshlet_end) = meshlet_mesh_id[meshlet_mesh_id];
let meshlet_id = meshlet_count_prefix_sums[instance_id - 1] - cluster_id; // If instance_id = 0, meshlet_id = cluster_id

meshlet_thread_instance_ids[cluster_id] = cluster_id;
meshlet_thread_meshlet_ids[cluster_id] = meshlet_id;
primal shadow
#

@wicked notch do your clusters (LOD 0 or otherwise) reference a single set of vertices, or do you have a new set of vertices per cluster?

wicked notch
#

per group but yes

#

idk if I want to keep that, but it's nice and shrimple

primal shadow
#

hrmm, I have a single set of vertices, and just reference them per cluster

#

everyone seems to do the opposite though

honest socket
#

having a single set of vertices is gonna eventually not be compatible with streaming when we add that

primal shadow
#

mhm

glass sphinx
#

ive never heard of that

#

every impl i saw is referencing a single set of verts

#

aside from those that quantize

primal shadow
#

Saved about 32ms of frame time in my nanite impl 😄

#

~8.4ms GPU, ~17.5-27.8ms CPU

#

No more uploading per-cluster data

#

170 FPS

wide shadow
#

This is like nanite outpost. Maybe this info will be useful.
Guys at epic games have been really busy making nanite better. Now you can have tessellation, displacement and skeletal animation with nanite.
They got rid of material buffer(abusing depth buffer with OP EQUAL) and now they shade in compute shader. They detect which derivatives(by looking at the dissembly of the vendor blob) are used and then select which pipeline is used by doing this they reduced amount of helper invocations by 6-10%. They now use VRS which 27-45% reduction in shading. Also something to make lpotrick little jealous is that now they use work graphs.
There is nothing(only git commits) on how they did tesselation, displacement a skeletal animations yet.

primal shadow
#

VRS/compute shading is on my list todo, but not very high priority atm

#

workgraphs (and mesh shaders) I don't have access to sadly

#

the other stuff there are papers on, although idk how nanite specifically implements it (I don't read their code)

#

I'm still a good bit off of nanite 1.0 levels of perf, although I'm making a lot of progress

#

There's also VSMs, which I haven't implemented at all

#

Per-view nanite is very expensive... and I want shadow views...

wispy spear
#

lustri, im really fond of the new auto thingy() -> shizzle {} way of conjuring c++

#

thanks for ze inzpiration and being a part of my life ❤️ 🙂

wispy spear
#

it do be

finite yacht
#

i am curious, why do you prefer it over shizzle thingy() {}

wispy spear
#

: )

finite yacht
#

great argument

wispy spear
#

that way all declarations are neatly aligned

#
virtual auto Initialize() -> bool;
virtual auto Load() -> bool;
virtual auto Unload() -> void;

virtual auto OnMouseMoved(double mouseX, double mouseY) -> void;
virtual auto OnMouseEntered() -> void;
virtual auto OnMouseLeft() -> void;
virtual auto OnFramebufferResized(int32_t width, int32_t height) -> void;
virtual auto OnFilesDropped(const std::vector<std::string>& fileNames) -> void;
finite yacht
#

thats nice

loud crag
#

and generally tbh I don't like trailing return types unless it's deductable

wispy spear
#

i dont know what that means, or rather how it would look like

#

leaving stuff out feels weird

loud crag
#

just as an example from my math thing im writing right now:

        constexpr auto operator*(vec<T, M> other) const noexcept {
            vec<T, M> ret(T(0));
            for (std::size_t i = 0; i < M; ++i)
                ret += col(i) * other[i];
            return ret;
        }
#

it just knows that the return type is vec<T, M>, and then just auto is enough

wispy spear
#

ah

loud crag
#

same happens with lambdas, generally

wispy spear
#

thats too next level for me still

#

although, in c# i do the same, when it comes to lambdas

#

and in c++ too i think, when it comes to lambdas

wispy spear
#

eh the other thing is prefixing types

#

EMyEnum, SMyStruct, im not cool with CMyClass yet 🙂 too deep wounds from the 90s, but its a logical thing to do

glass sphinx
#

its not something that can be added easily

glass sphinx
#

shouldnt it be totally gpu bound if is nanite like?

primal shadow
glass sphinx
#

ah thats the savings number

primal shadow
#

Yes

wheat haven
#

went to the retina repo for some inspiration and noticed you managed to get an all-numbers commit hash (at least the short part github shows)

severe dome
#

it becomes significantly more beneficial when you start doing things with templates

#

any kind of long name in a return type, or any kind of sfinae or type trait nonsense is completely unreadable when not using trailing return types

#
  • trailing return types for member functions have access to the namespace of thr class they're in, which is great if you're using aliases a lot
#

Foo::Bar<Foo::Baz> Foo::buh(); will always be inferior to auto Foo::buh() -> Bar<Baz>;

#

just better in every way

wheat haven
#

you're saying that normal return types can't? I do that all the time

severe dome
#

not if you define them outside the class body, no

wheat haven
#

ah that, yeah

#

you gave declarations as examples so I was confused

severe dome
#

I'm on mobile so didn't want to flesh them out :)

wispy spear
#

: )

#

monsieur lustri also added rust like typisms ala "self" too, but im not there yet

severe dome
#

can't use those bc I'm stuck on C++17 :(

primal shadow
#

Ok since I'm going to have to do it 4+ times/view I decided to allocate buffers to hold instance+meshlet ID per cluster in the scene, instead of doing the binary search in every pass

#

Unfortnatently that means another 8 bytes/cluster of memory usage :((

#

100 million clusters = 800 mb 😦

#

In the scene, not rendered

#

And LOD clusters count towards that

#

~6.7ms/frame, but ~2ms of that is CPU time 😦

primal shadow
#

I think if I want to save 8bytes/scene_cluster, I'll need to switch to persistent culling

#

Also save some culling time, probably like ~0.3ms

#

Culling is already extremely fast for me though

glass sphinx
primal shadow
primal shadow
#

So I've been coming back to this. Yes, clipping the triangle vertices is wrong. But why not clip the triangle AABB you use for testing pixels covered by the triangle?

primal shadow
#

Wait, I think Nanite only forces HW rasterization for depth clipping, and not for viewport

#

that makes more sense

wicked notch
#

something's cooking lads

wispy spear
#

pasta water?

wicked notch
#

I cooked that earlier KEKW

#

something else's cooking

wispy spear
#

Nanite 2.0's oven timer just went off

wispy spear
#

lustri my frosch, you are using cmake's pchisms

#

but im a little bamboozled by it

#

you create a pch library so to speak, and in it you have all the stl headers you use across the retina

#

yet you include the pch nowhere, but whatever stl header individually again

#

does "linking with the pch lib" take care of the usual "include <pch.h>"?

frank sail
#

I think the pch is automatically included by everything in the cmake way

wispy spear
#

that means you could remove the individual stl includes

#

neh?

frank sail
#

I think it means those includes are now free

wispy spear
#

at least that was my understanding how cmakes pchisms should work

wicked notch
#

they're redundant yes

wispy spear
#

ah

frank sail
#

I'd keep them, imo it makes things more readable

wispy spear
#

: )

#

that is/was the conchfusing conchfusion

#

thank you my froggies

#

gcc 14.1.1 is out btw

faint crane
#

Someone beat me to it.

finite yacht
wispy spear
#

@wicked notch are you alright

wicked notch
#

it depends

wispy spear
#

i heard italy got hit by earfquakes, around napoli

wicked notch
#

I don't live there KEKW

#

I haven't heard the news

wispy spear
#

ah

#

was worried about your pizza oven 😄

frank sail
#

reminds me of the earthquake that hit where I live, and I didn't feel shit

wicked notch
#

we can make pizza outside of napoli you know

wispy spear
#

😛

#

also last night jaker?

frank sail
#

nah it was months ago

#

now I'm looking for a meme

wispy spear
#

ah

#

: )

frank sail
#

found it

loud crag
#

@primal shadow what gltf extension do you mean exactly?

primal shadow
loud crag
#

ah you said „bevy meshlet extension“ in showcase and that didnt sound like it was just custom for your engine

primal shadow
#

Ah no it's a custom extension, bevy is the name of the engine I work on

faint crane
#

Awesome work. Starting on this and my goal was to get Bistro itself rendering.

#

This is what I'm rolling with. Doesn't eat up 5GB of VRAM so it's a start,.

primal shadow
#

I'm also working on writing up a (very long) blog post detailing how everything works in my meshlet renderer so far, that'll come out sometime in the next month probably

faint crane
#

Nice. Looked over the earlier PRs. Yours is definitely easiest to follow if nothing but a reminder that my occlusion culling doesn't fully work. brianna_pls

#

Seems like an unending problem.

primal shadow
#

Neither does mine, it seems to over-cull on thin planes 😦

#

The screenshots of bistro I took were carefully positioned to avoid showing the bug 😅

faint crane
#

I'll figure something out. Using WebGPU via Chrome so a combination of JS and Rust/WASM.

#

Will at least create new bugs.

primal shadow
#

WebGPU shouldn't really limit you for this. No min sampler mode, but it's not neccesary anyways.

#

It's just a lot of working implementing everything and maing it fast 😅

#

I still need to redo a lot of things

faint crane
#

It's just the browser really. Need to see if any GPU debuggers work.

#

Plan to shoehorn it into WebGL where I'm really limited.

wicked notch
wicked notch
#

because vertex welding is making me sad

primal shadow
#

Ah. Why so?

wheat haven
#

What kind of welding do you mean? Just de-duplicating? Or merging nearby vertices?

wicked notch
primal shadow
#

Yeah UV seams make things really difficult

primal shadow
#

I'm starting on my blog post to accompany the Bevy 0.14 release notes for the new (experimental) meshlet feature. Maybe it's just my writing style, or the amount of info I have to cover and how much I hate writing, but it feels very dry :(. I'm hoping it's at least informative? Idk, only half done, and the asset processing stuff is the boring part and the section I don't have much to talk about, relative to the (not yet written) shader section, where I did a bunch of things differently from Nanite. https://github.com/JMS55/jms55.github.io/blob/a3c272a4563675c6888cfce6565784af1c552b0c/content/posts/2024_05_10_virtual_geometry_bevy_0_14/index.md.

dull oyster
#

I would guess that "dryness" will varies per reader. Some people may prefer a straight-to-the-point article, and it might be easier to read for non native speakers.

Taking myself as an example, I try to write things as I would say them, but I tend to prefer focused articles if I want to learn a technique*. Furthermore, english is not my native language but I found your introduction easy to read and understand, and the other parts are straight-to-the-point (that does not mean they are hard to read nor understand!), as you write it in the article "in favor of just describing the algorithm itself". That's a complete valid reason, you already have a lot to write about!

runic surge
#

Images are great at getting the point across and lets you re-reference them later if you need them

primal shadow
primal shadow
dull oyster
#

If the description was too brief, you can always make a follow up article or updates to your article

primal shadow
#

True

frank sail
#

Now I just gotta wait for the to-do sections hehe

primal shadow
#

Thanks for the feedback! Working on it in parallel with actually writing new code 😅

wispy spear
wicked notch
#

I literally read "John Khronos as a member"

#

my brain is truly fried

wispy spear
#

hehe

#

man they create new shit every week

#

now new PBR tonemapper

#

3d commerce bs

#

anari nonsense

#

whatever that kamaros thing is

finite quartz
primal shadow
#

Terminology and 1 pass of my renderer written about for the blog post - 8 more passes and the future work section left 😅

dull oyster
primal shadow
#

Unreal's solution is better though, I'm going to copy that. It's going to be part of the future work section.

#

Instead of dispatching per cluster, build a BVH over the lod tree, and traverse the tree per instance instead with persistent culling.

dull oyster
primal shadow
dull oyster
#

Oh right

primal shadow
#

Day ? of ??? of writing. Did half a shader, didn't even manage a full pass 😅. Tbf cluster culling is a lot.

primal shadow
#

@wicked notch @dull oyster I've finished writing the first draft of my frame breakdown. Not published to my blog yet, but here's an early access link: https://github.com/JMS55/jms55.github.io/blob/meshlet-wip/content/posts/2024_06_08_virtual_geometry_bevy_0_14/index.md

I'd very much appreciate if you could give it a review! Criticism very much welcome. It's hard to know what other people will intuitively understand or what needs more explanation when writing in general, and this is an extremely technical topic that I tried to avoid spending 10 years writing 😅 .

wicked notch
#

I'll give it a read later tonite

velvet marsh
#

that's pretty incredible, thanks for writing that, that's immensely valuable to learn how you implemented virtual geometry and some of these concepts, I hadn't heard of a depth pyramid before or virtual geometry

dull oyster
# primal shadow <@320895822394818561> <@163242187084005377> I've finished writing the first draf...

First off, congrats on writing about the entire pipeline!

Overall I think the article is very interesting.

I was on mobile so I could not check the code easily but here is my feedback:

  • "In the example with the wall with the rocks and trees behind it, we could see that last frame the wall clusters contributed pixels to the final image, but none of the rock or tree clusters did. “
    Maybe I missed something but I could not see this example?

  • I like the discussion of why a single indirect draw was chosen

  • Writing different vertex indices for the material pass is smart

wispy spear
#

something which could/should be presented at GPC in november 😉

dull oyster
#

November? Not sure I'll be very available 😬

wispy spear
#

calling for papers ends at end of june (can probably be extended)

#

November 12-14, 2024

#

just saying

primal shadow
# dull oyster First off, congrats on writing about the entire pipeline! Overall I think the a...

Maybe I missed something but I could not see this example?
It was two paragraphs ago, the first one in that section
"I mentioned earlier that frustum culling is not sufficent for complex scenes. With meshlets, we're going to have a lot of geometry in view at once. Rendering all of that is way too expensive, and unnecessary. It's a complete waste to spend time rendering a bunch of detailed rocks and trees, only to draw a wall in front of it later on (overdraw)."

Glad you liked it!

primal shadow
dull oyster
wispy spear
velvet marsh
#

keep learning new things on reread, frame budgets!

#

also using previous frame data to do occlusion culling is something I've been reading about elsewhere, it's super interesting, and I have no idea how to do it yet

frank sail
#

didnotread but typically it would be that you produce visibility of objects for the next frame

#

so 1 bit per object that you write to this frame and read from the next frame

faint crane
primal shadow
#

The only thing is it suggests using int mipLevel = floor(log2(max(AABB.pixelWidth, AABB.pixelHeight)));

#

I had bugs with that. I instead used max(0, ceil(log2(max_width_or_height))

#

You also don't have to store an explicit 0/1 visible bit between frames. I think it's easier to use last frame's depth pyramid.

velvet marsh
#

thank you

primal shadow
#

np

wispy spear
#

you could add +1 to the floor one

ebon ruin
#

Who’s thread is this?

wispy spear
#

lvstri's

ebon ruin
#

it looks like he posts 4 times a month lol

faint crane
#

This is the lvstri fan fan club.

delicate rain
#

We all gather in the hopes of LVSTRI showing up

#

Still didn't get my gpu signed by him

wispy spear
#

lustri is also neck deep in uni shit

delicate rain
#

I am officially done today

#

passed the last exam two hours ago 😄

#

Time to get back to the VSM struggle froge_bleak

wispy spear
#

woohoo

velvet marsh
#

Congratulations!

delicate rain
#

just one more semester

#

last one bro I promise

wispy spear
#

hu?

#

you said you are done ange

delicate rain
#

ah sorry, done with this semester

wispy spear
#

: )

delicate rain
#

3 months free-ish time looking for temp job

#

and then one more

wispy spear
#

so they say

wicked notch
#

soon™️

wispy spear
ebon ruin
#

well

#

lvstri follows my thread

#

so ill follow his

#

he seems cool

wispy spear
#

he is

wicked notch
#

welcome to nanite HQ

delicate rain
#

Piece by piece we reassemble unreal engine 5

#

And once we manage it, ue6 will release, featuring a fully ai driven rendering pipeline

#

And we switch to being embedded developers

primal shadow
#

So it was pointed out to me that Tencent made a Nanite-like system on mobile

#

They mention they used 32 bit atomic buffers for software rendering

#

But this has got to be one of the least informative GDC talks I've ever seen. They explain nothing

#

So frustrating. I really want to know how they made 32 bit atomics work

storm wolf
#

i was under the impression that nanite used atomics as well

#

is there another way for software rast?

wicked notch
#

perhaps, it depends on how you want to tune your virtual geometry

#

if you don't need pixel sized triangles then you could entirely skip the soft rast

#

otherwise atomics are kinda necessary because you need to min on depth

storm wolf
#

is there some trick other than min/max and output - i cant imagine it is too complex

wicked notch
#

hmm the objective in the end is to write depth in some way

#

how would one emulate the depth test without min?

frank sail
#

per-pixel linked list of fragment outputs, then sorting

primal shadow
#

It was something like that, I forget the exact details

glass sphinx
#

@primal shadow do you also stream and store partial lods in this impl?

primal shadow