Iris - A Journey through OpenGL and beyond to learn Graphics | Graphics Programming | Page 20

glass sphinx Apr 19, 2024, 10:43 PM

#

i mean just looking at the capture it seems the wrote for indices is 4x the drawtime

primal shadow Apr 19, 2024, 10:43 PM

#

Anyone know where I can upload a 35mb trace?

frank sail Apr 19, 2024, 10:43 PM

#

Discord

wicked notch Apr 19, 2024, 10:43 PM

#

should be fine here no?

#

I think the limit is 100MiB or something

primal shadow Apr 19, 2024, 10:43 PM

#

Oh cool, thought the limit was 25mb

📎 dumb_meshlet_trace.zip

#

it's 50mb

frank sail Apr 19, 2024, 10:44 PM

#

It depends on the server

primal shadow Apr 19, 2024, 10:44 PM

#

glass sphinx i mean just looking at the capture it seems the wrote for indices is 4x the draw...

Idk, when I tried this several months ago it was slower. Let me try again

#

But also, can you explain the workgroup barrier omission first?

glass sphinx Apr 19, 2024, 10:44 PM

#

i think my draw was around 20% slower but most of my meshlets were relatively full

#

but the compute part to write was way faster

frank sail Apr 19, 2024, 10:45 PM

#

primal shadow But also, can you explain the workgroup barrier omission first?

He is trying to make you invoke ub

glass sphinx Apr 19, 2024, 10:45 PM

#

primal shadow But also, can you explain the workgroup barrier omission first?

just to an atomic per thread

primal shadow Apr 19, 2024, 10:46 PM

#

They can't, I'd be growing the count too much. It's important that it's only incremented once.

#

I do have acess to subgroups though btw, as of this week

wicked notch Apr 19, 2024, 10:46 PM

#

primal shadow They can't, I'd be growing the count too much. It's important that it's only inc...

the hw will make sure it's only once

glass sphinx Apr 19, 2024, 10:46 PM

#

instead of vertex count, each thread adds one

primal shadow Apr 19, 2024, 10:46 PM

#

oh

wicked notch Apr 19, 2024, 10:47 PM

#

btw is it normal that your vkCmdDispatch is 232x232x232? bleakekw

glass sphinx Apr 19, 2024, 10:47 PM

#

kekek

primal shadow Apr 19, 2024, 10:47 PM

#

wicked notch btw is it normal that your vkCmdDispatch is 232x232x232? <:bleakekw:108259835030...

I ran out of room in a 1d dispatch so I made it 3d

#

since it's 1 wg per cluster

wicked notch Apr 19, 2024, 10:48 PM

#

huh weren't you doing the two pass kinda thing?

#

which is muuuch better since you avoid that early out

#

~~you also avoid spawning ludicrous amounts of compute invocations~~

primal shadow Apr 19, 2024, 10:49 PM

#

wicked notch huh weren't you doing the two pass kinda thing?

I am. Culling and draw index buffers are seperate.

wicked notch Apr 19, 2024, 10:49 PM

#

wait so you have 232x232x232 surviving meshlets?

#

wot

primal shadow Apr 19, 2024, 10:49 PM

#

No, It's not an indirect draw

wicked notch Apr 19, 2024, 10:49 PM

#

ah

#

then you're definitely wasting time innit

primal shadow Apr 19, 2024, 10:50 PM

#

It's on my list to write a compact buffer of surviving cluster IDs and make it an indirect dispatch, though

#

Or just combine the two passes if I can

#

maybe do some sort of subgroup/workgroup prefix sum to collacese the writes in the culling shader

#

sooo many possible options, tedious to test every one

primal shadow Apr 19, 2024, 10:51 PM

#

glass sphinx instead of vertex count, each thread adds one

// Reserve space in the buffer for this meshlet's triangles
draw_index_buffer_start_workgroup = atomicAdd(&draw_indirect_args.vertex_count, 1u);
draw_index_buffer_start_workgroup /= 3u; // ????

// Each thread writes one triangle of the meshlet to the buffer slice reserved for the meshlet
let meshlet_id = meshlet_thread_meshlet_ids[cluster_id];
let meshlet = meshlets[meshlet_id];
if triangle_id < meshlet.triangle_count {
    draw_index_buffer[draw_index_buffer_start_workgroup + triangle_id] = (cluster_id << 8u) | triangle_id;
}

This can't be sane, right? So atomicAdd() is going to return the same value for every thread??

#

uhh, the value before all the adds in the workgroup, or after?

glass sphinx Apr 19, 2024, 10:53 PM

#

wait

#

i might be retarded (i definetly am)

#

i need to go on pc one sec

wicked notch Apr 19, 2024, 10:54 PM

#

primal shadow ```rust // Reserve space in the buffer for this meshlet's triangles draw_index_b...

I think you can just write this like so:

glass sphinx Apr 19, 2024, 10:56 PM

#

primal shadow ```rust // Reserve space in the buffer for this meshlet's triangles draw_index_b...

do

#

you just use the offset every thread gets locally

primal shadow Apr 19, 2024, 10:56 PM

#

These are my shaders btw https://github.com/bevyengine/bevy/tree/0ebd414dbc13ef77b84627ca5d0e82b72cd50262/crates/bevy_pbr/src/meshlet

glass sphinx Apr 19, 2024, 10:56 PM

#

the atomic add retuens a different value for each one

wicked notch Apr 19, 2024, 10:56 PM

#

const bool is_valid_primitive = gl_LocalInvocationID.x < meshlet.primitive_count;
const uint local_offset = subgroupExclusiveAdd(uint(is_valid_primitive));
uint global_offset = 0;
if (subgroupElect()) {
  global_offset = atomicAdd(..., (local_offset + 1) * 3);
}
buffer[global_offset + local_offset] = ...;```

#

I think this is correct?

glass sphinx Apr 19, 2024, 10:57 PM

#

you can remove the local offset and the subgroup elect as well

primal shadow Apr 19, 2024, 10:57 PM

#

oh, huh. Don't actually enforce consecutive locations in the buffer are for the same cluster, ok

glass sphinx Apr 19, 2024, 10:58 PM

#

i mean it will be

#

cause the hw will do it with subgroup ops

primal shadow Apr 19, 2024, 10:58 PM

#

right

glass sphinx Apr 19, 2024, 10:58 PM

#

but yea you can do it like lvstri showed for piece of mind

#

lvstri your forgot a subgeoup broadcast first

#

🫨

wicked notch Apr 19, 2024, 10:59 PM

#

ah you're right

#

it's joever

#

I'm not well versed in subgroup magique

glass sphinx Apr 19, 2024, 10:59 PM

#

subgeoup majonaise is my favourite dressing

primal shadow Apr 19, 2024, 11:01 PM

#

@compute
@workgroup_size(64, 1, 1) // 64 threads per workgroup, 1 workgroup per cluster, 1 thread per triangle
fn write_index_buffer(@builtin(workgroup_id) workgroup_id: vec3<u32>, @builtin(num_workgroups) num_workgroups: vec3<u32>, @builtin(local_invocation_index) triangle_id: u32) {
    // Calculate the cluster ID for this workgroup
    let cluster_id = dot(workgroup_id, vec3(num_workgroups.x * num_workgroups.x, num_workgroups.x, 1u));
    if cluster_id >= arrayLength(&meshlet_thread_meshlet_ids) { return; }

    // If the cluster was culled or already drawn in the first pass, then we don't need to draw it
    if !meshlet_should_draw(cluster_id) { return; }

    // Exit this thread if the triangle slot is empty in the meshlet
    let meshlet_id = meshlet_thread_meshlet_ids[cluster_id];
    let meshlet = meshlets[meshlet_id];
    if triangle_id >= meshlet.triangle_count { return; }

    // Write cluster ID + triangle ID
    let i = atomicAdd(&draw_indirect_args.vertex_count, meshlet.triangle_count * 3u) / 3u;
    draw_index_buffer[i] = (cluster_id << 8u) | triangle_id;
}

Yeah?

#

What's the subgroup stuff for in yours?

#

uh oh, something is broken

#

oh whoops

#

Should be let i = atomicAdd(&draw_indirect_args.vertex_count, 3u) / 3u;

glass sphinx Apr 19, 2024, 11:10 PM

#

you should add 1

#

not tri count

#

or 3

#

yes 3

#

and then div by three

primal shadow Apr 19, 2024, 11:11 PM

#

Ok I mean it renders fine, but it appears to be exactly the same perf...

glass sphinx Apr 19, 2024, 11:11 PM

#

when you add 3?

primal shadow Apr 19, 2024, 11:11 PM

#

Code is a bit simpler though so that's cool atl

glass sphinx Apr 19, 2024, 11:11 PM

#

yea then its something else slowing it down

primal shadow Apr 19, 2024, 11:12 PM

#

Why am I only at 20% occupancy :/

wicked notch Apr 19, 2024, 11:13 PM

#

send spirv I guess KEKW

#

maybe wgsl compiler is generating absolute retard code

#

but I doubt

glass sphinx Apr 19, 2024, 11:14 PM

#

i am downloading nsight nr

#

rn

#

i only have my laptop as im on a worktrip

#

also l´slow hotel internet

primal shadow Apr 19, 2024, 11:14 PM

#

NSight also says this I noticed

#

Is that just because I have perfect occlusion culling lol

#

And it think it means culling is doing nothing?

glass sphinx Apr 19, 2024, 11:15 PM

#

no it says its disabled

#

thats really strange

#

do you use discard_

primal shadow Apr 19, 2024, 11:16 PM

#

What does it mean?

glass sphinx Apr 19, 2024, 11:16 PM

#

?

primal shadow Apr 19, 2024, 11:16 PM

#

glass sphinx do you use discard_

in the shader? nope

glass sphinx Apr 19, 2024, 11:16 PM

#

show frag shader

primal shadow Apr 19, 2024, 11:16 PM

#

https://github.com/bevyengine/bevy/blob/0ebd414dbc13ef77b84627ca5d0e82b72cd50262/crates/bevy_pbr/src/meshlet/visibility_buffer_raster.wgsl

glass sphinx Apr 19, 2024, 11:16 PM

#

nsight is expecting a 3.3x speedup with early z enabled damn

primal shadow Apr 19, 2024, 11:16 PM

#

It's for the shadow view only, the main view dosen't have this issue

glass sphinx Apr 19, 2024, 11:17 PM

#

ah you write frag deprth

primal shadow Apr 19, 2024, 11:17 PM

#

Right, it's rasterizing depth

#

Eventually when wgpu gets image atomics, I'm just going to use that

glass sphinx Apr 19, 2024, 11:18 PM

#

im confused

#

why do you need to write it like that

primal shadow Apr 19, 2024, 11:18 PM

#

Yeah this bit:

#ifdef DEPTH_CLAMP_ORTHO
@fragment
fn fragment(vertex_output: VertexOutput) -> @builtin(frag_depth) f32 {
    return vertex_output.unclamped_clip_depth;
}
#endif

wicked notch Apr 19, 2024, 11:18 PM

#

you could just enable depth clamp in your pipeline state

glass sphinx Apr 19, 2024, 11:18 PM

#

your frag shader could be just empty no_

#

?

primal shadow Apr 19, 2024, 11:18 PM

#

This is from somewhere else in bevy, let me figure out why that has to be a thing

glass sphinx Apr 19, 2024, 11:18 PM

#

or write a color attachment if its special

#

no its not

#

this will write the depth buffer

#

it will disable early z

#

as you could change the value to be smaller then it is when early z runs invalidating it

#

bleakekw bevy going hard i see i see

#

i think you can just make the frag shader empty

#

return nothing

primal shadow Apr 19, 2024, 11:21 PM

#

Ok I can't even find a PR where DEPTH_CLAMP_ORTHO was added

#

wtf is it doing?

wicked notch Apr 19, 2024, 11:21 PM

#

depth clamping is necesary for some shadow mapping techniques

#

but it's an option in your rasterizer state that you can just enable, there's no reason to do it yourself

primal shadow Apr 19, 2024, 11:22 PM

#

I feeeelllll like we tried that, hrm

#

let me try it though

glass sphinx Apr 19, 2024, 11:23 PM

#

btw what gpu do you have_

#

?

#

i dont get how the occupancy is so low

primal shadow Apr 19, 2024, 11:24 PM

#

Rtx 3080

primal shadow Apr 19, 2024, 11:24 PM

#

wicked notch but it's an option in your rasterizer state that you can just enable, there's no...

am I dumb, I don't see it?

wicked notch Apr 19, 2024, 11:25 PM

#

primal shadow am I dumb, I don't see it?

https://registry.khronos.org/vulkan/specs/1.3-extensions/man/html/VkPipelineRasterizationStateCreateInfo.html

primal shadow Apr 19, 2024, 11:25 PM

#

https://paste.rs/4qkiq.rs

Source Code | 4qkiq | Rocket Powered Pastebin

a simple, no-frills, command-line driven
pastebin service powered by the Rocket web framework.

buoyant summit Apr 19, 2024, 11:25 PM

#

vkCmdSetDepthClampEnableEXT

primal shadow Apr 19, 2024, 11:25 PM

#

I don't think wgpu exposes it xD

wicked notch Apr 19, 2024, 11:25 PM

#

bruh

buoyant summit Apr 19, 2024, 11:26 PM

#

wicked notch bruh

https://github.com/gpuweb/gpuweb/issues/2100

primal shadow Apr 19, 2024, 11:26 PM

#

https://github.com/gfx-rs/wgpu/issues/3900

wicked notch Apr 19, 2024, 11:27 PM

#

When you set DepthClipEnable to FALSE, the hardware skips the z clipping (that is, the last step in the preceding algorithm). However, the hardware still performs the "0 < w" clipping. When z clipping is disabled, improper depth ordering at the pixel level might result. However, when z clipping is disabled, stencil shadow implementations are simplified. In other words, you can avoid complex special-case handling for geometry that goes beyond the back clipping plane

#

~~how does anyone unironically use D3D12~~

glass sphinx Apr 19, 2024, 11:28 PM

#

you will loose a ton of perf if you can not do early z

#

but you can force it on in most apis

primal shadow Apr 19, 2024, 11:29 PM

#

I'll add it to my list of "things that will be faster when X wgpu issue is closed" :/

glass sphinx Apr 19, 2024, 11:29 PM

#

LGPU

buoyant summit Apr 19, 2024, 11:29 PM

#

wicked notch ```When you set DepthClipEnable to FALSE, the hardware skips the z clipping (tha...

ms is meming me or vk spec is

wicked notch Apr 19, 2024, 11:30 PM

#

I think MS is meming

buoyant summit Apr 19, 2024, 11:30 PM

#

I don't see vk spec talk about w, it seems to be operating after / w

#

so it's talking about clamping z/w

wicked notch Apr 19, 2024, 11:30 PM

#

yes

#

as it should be

#

idk what the hell D3D12 wants to accomplish here

buoyant summit Apr 19, 2024, 11:30 PM

#

what this tells me is that we don't have enough CTS in this area

#

in vk

#

or I'm misunderstanding something

primal shadow Apr 19, 2024, 11:31 PM

#

This shouldn't matter once I add software raster anyways, as I'll be directly blitting the depth in the fragment shader

#

Ok, so, raster is not going to get any faster anytime soon until I get software raster or mesh shaders

#

Back to writing out an index buffer

wicked notch Apr 19, 2024, 11:33 PM

#

did you profile your shader yet

primal shadow Apr 19, 2024, 11:33 PM

#

Oh no, let me go do that

#

I got distracted by the zcull lmao

buoyant summit Apr 19, 2024, 11:34 PM

#

wicked notch idk what the hell D3D12 wants to accomplish here

https://github.com/gpuweb/gpuweb/issues/2100#issuecomment-923248554

#

from ms person

primal shadow Apr 19, 2024, 11:34 PM

#

buoyant summit <https://github.com/gpuweb/gpuweb/issues/2100#issuecomment-923248554>

So it osunds like on directx we can remove the fragment shader workaround?

buoyant summit Apr 19, 2024, 11:36 PM

#

idk what it sounds like

glass sphinx Apr 19, 2024, 11:36 PM

#

jasmin could you try increasing the wg size for me in the write index buffer shader?

#

btw you hsoul default to 128 or 256 size wgs

buoyant summit Apr 19, 2024, 11:36 PM

#

primal shadow So it osunds like on directx we can remove the fragment shader workaround?

if your intent is to clamp things no I don't believe you can

wicked notch Apr 19, 2024, 11:36 PM

#

buoyant summit <https://github.com/gpuweb/gpuweb/issues/2100#issuecomment-923248554>

ok this makes more sense

primal shadow Apr 19, 2024, 11:36 PM

#

Idk I'm lost by it lol

glass sphinx Apr 19, 2024, 11:36 PM

#

the 3080 cant reach peak occupancy for workgroups smaller then 96 threads but with 128 it can reach max launch speeds for warps as well

primal shadow Apr 19, 2024, 11:36 PM

#

glass sphinx the 3080 cant reach peak occupancy for workgroups smaller then 96 threads but wi...

What about AMD?

glass sphinx Apr 19, 2024, 11:37 PM

#

amd also preferes the same numbers per default

#

64 is good for time - diverget work

primal shadow Apr 19, 2024, 11:37 PM

#

ok, I'll switch culling to 128, that's easy

glass sphinx Apr 19, 2024, 11:37 PM

#

but for things like this that finishes fast its better to have larger groups

primal shadow Apr 19, 2024, 11:37 PM

#

glass sphinx 64 is good for time - diverget work

well culling is divergent

#

so maybe I'll keep it 64 lol

glass sphinx Apr 19, 2024, 11:38 PM

#

how is cullign divergent

#

its not really

primal shadow Apr 19, 2024, 11:38 PM

#

1 thread per cluster

#

some early exit on frustum culling or w/e

glass sphinx Apr 19, 2024, 11:38 PM

#

with divergent i mean massive differences in runtime between threads like 10x

primal shadow Apr 19, 2024, 11:38 PM

#

oh ok

glass sphinx Apr 19, 2024, 11:38 PM

#

for example compute raytracing for ssr or ssao

#

thats mega divergent

#

culling should be pretty cozy

primal shadow Apr 19, 2024, 11:39 PM

#

Ah, thank you. Yay for learning.

wicked notch Apr 19, 2024, 11:39 PM

#

you retire the entire workgroup at once with culling

#

zero divergence frogapprove

primal shadow Apr 19, 2024, 11:39 PM

#

Ok so, profile write_index_buffer, then try increasing the thread count

glass sphinx Apr 19, 2024, 11:39 PM

#

somehow the compute job acts as tho you have workgroups the size of 32 or so

#

in nsight it looks like that at least

primal shadow Apr 19, 2024, 11:39 PM

#

wicked notch you retire the entire workgroup at once with culling

No, 1 cluster per thread, so some may get to occlusionc ulling, others may get frustum or lod culled and exit before that

wicked notch Apr 19, 2024, 11:40 PM

#

I mean the write index buffer

#

my bad

#

culling speed is not an issue is it

#

even if you diverge like that it's pretty minor

primal shadow Apr 19, 2024, 11:41 PM

#

yeah cullign speed is like 0.3ms on this insane cluster count

#

it's nothing

#

raster + write index buffer + per frame data uploads are the issue (ignoring shading, currently broken, and inefficent anyways :P)

glass sphinx Apr 19, 2024, 11:44 PM

#

this is why we love mesh shaders

#

and soon work graphs with mesh nodes

#

warp drive

primal shadow Apr 19, 2024, 11:44 PM

#

yeahhhh

#

not going to get that for a looong time

wicked notch Apr 19, 2024, 11:44 PM

#

glass sphinx and soon work graphs with mesh nodes

gpu goes overdrive mode

#

froge_love

glass sphinx Apr 19, 2024, 11:47 PM

#

my fingers are itching

#

jasmine can you run with 128 wg size

#

https://tenor.com/view/puppy-eyes-sorry-gif-13770628

Tenor

primal shadow Apr 19, 2024, 11:48 PM

#

It takes a good 5m per trace because there's so much data lol, give me a few mins 😅

#

I'll have to switch to 1 frame per trace, even 2 takes forever

glass sphinx Apr 19, 2024, 11:48 PM

#

oki

primal shadow Apr 19, 2024, 11:49 PM

#

#

ALU bound??

glass sphinx Apr 19, 2024, 11:49 PM

#

btw @primal shadow its ub to return from a shader before entering barriers

#

all threads must enter the same barriers

primal shadow Apr 19, 2024, 11:49 PM

#

whuh

glass sphinx Apr 19, 2024, 11:49 PM

#

or does wgpu do some magic?

primal shadow Apr 19, 2024, 11:49 PM

#

I assume wgpu is protecting me here

#

Cause I don't get any errors from them

#

Actually no, meshlets don't use barriers anymore

glass sphinx Apr 19, 2024, 11:50 PM

#

yea it will work

primal shadow Apr 19, 2024, 11:50 PM

#

Except for downsample depth hiz

glass sphinx Apr 19, 2024, 11:50 PM

#

most of the time

primal shadow Apr 19, 2024, 11:50 PM

#

when I just copied from SPD

glass sphinx Apr 19, 2024, 11:50 PM

#

but then you wll die randomly

wicked notch Apr 19, 2024, 11:50 PM

#

primal shadow I assume wgpu is protecting me here

can wgpu even detech such cases?

primal shadow Apr 19, 2024, 11:51 PM

#

Idk how to export the shader, so let me upload the trace again ig

glass sphinx Apr 19, 2024, 11:51 PM

#

yea that would be sus as then shading langs should do that also

glass sphinx Apr 19, 2024, 11:51 PM

#

primal shadow

susus

#

how did it nt change

#

very strange

primal shadow Apr 19, 2024, 11:51 PM

#

this is with 64 wg still

buoyant summit Apr 19, 2024, 11:51 PM

#

wicked notch can wgpu even detech such cases?

no

glass sphinx Apr 19, 2024, 11:52 PM

#

primal shadow this is with 64 wg still

ah i see

buoyant summit Apr 19, 2024, 11:52 PM

#

webgpu also leaves that to be ub

left jacinth Apr 19, 2024, 11:52 PM

#

good

glass sphinx Apr 19, 2024, 11:52 PM

#

no safe in gpu land

wicked notch Apr 19, 2024, 11:52 PM

#

safety memes don't work in gpu land

glass sphinx Apr 19, 2024, 11:52 PM

#

i connected to gabes neuron to help

wicked notch Apr 19, 2024, 11:52 PM

#

checkmate

primal shadow Apr 19, 2024, 11:52 PM

#

I wish webgpu was more like wgpu_hal sometimes :/

#

I am 4mb over the limit even when compressed ahhh

buoyant summit Apr 19, 2024, 11:53 PM

#

nvm I was overthinking it

primal shadow Apr 19, 2024, 11:53 PM

#

anyways this is the only expensive part of the shader

buoyant summit Apr 19, 2024, 11:53 PM

#

but ye webgpu does not protect you from ub

#

you will get ub

#

it's extremely infectious

primal shadow Apr 19, 2024, 11:53 PM

#

and a little here

glass sphinx Apr 19, 2024, 11:53 PM

#

ok i have a suspicion

buoyant summit Apr 19, 2024, 11:53 PM

#

reminder that lgsb means waiting for global memory op

glass sphinx Apr 19, 2024, 11:53 PM

#

does it run with a workgroup size of 32?

#

and loops?

#

can zou see that in the spirv?

primal shadow Apr 19, 2024, 11:54 PM

#

buoyant summit reminder that lgsb means waiting for global memory op

reads or writes?

glass sphinx Apr 19, 2024, 11:54 PM

#

buoyant summit reminder that lgsb means waiting for global memory op

what i dont get are the unused warp slots

left jacinth Apr 19, 2024, 11:54 PM

#

long scoreboard (waiting on a read) doesn't have anything to do with why the workgroup is size 32

buoyant summit Apr 19, 2024, 11:54 PM

#

primal shadow reads or writes?

either

#

it can be texture samples, loads, stores, atomics, whatever, but operating on global memory (or scratch/private/stack/whatever too)

left jacinth Apr 19, 2024, 11:54 PM

#

(presumably is size 32 with a loop, because there's no other explanation)

glass sphinx Apr 19, 2024, 11:55 PM

#

jasmine in the spv at the top can you give us the workgroup sizes?

primal shadow Apr 19, 2024, 11:55 PM

#

📎 write_index_buffer.csv

primal shadow Apr 19, 2024, 11:55 PM

#

glass sphinx jasmine in the spv at the top can you give us the workgroup sizes?

it dosen't say?

buoyant summit Apr 19, 2024, 11:57 PM

#

11, OpExecutionMode %write_index_buffer LocalSize 64 1 1 ; 0x00000098,,,,,0,,,

#

64x1x1

glass sphinx Apr 19, 2024, 11:57 PM

#

yea

#

hmmm

#

i wonder if 128 size changes anything

primal shadow Apr 19, 2024, 11:57 PM

#

let me go try

#

2 clusters per wg now, give me a few minutes to rewrite things

glass sphinx Apr 19, 2024, 11:58 PM

#

ok

#

i suspect its launch bound, so the gp cant launch new workgroups fast enough and the warps retire so fast that the gpu cant go full saturation

#

so making the wg bigger should help a lot then

primal shadow Apr 19, 2024, 11:59 PM

#

nsight won't tell me that?

glass sphinx Apr 19, 2024, 11:59 PM

#

at least to increase occupancy

#

i think it did say that at some point but they removedthe metric

#

they changed it a lot last year very annoying

primal shadow Apr 19, 2024, 11:59 PM

#

glass sphinx i suspect its launch bound, so the gp cant launch new workgroups fast enough and...

probably because I spawn a bunch of dummy workgroups that early exit as soon as it checks the cluster is culled...

#

Changing it to an indirect dispatch over surviving meshlets only should help I believe

#

or ideally just combine the pass into the culling pass

wicked notch Apr 20, 2024, 12:00 AM

#

wicked notch ~~you also avoid spawning ludicrous amounts of compute invocations <:bleakekw:10...

if only

#

KEKW

primal shadow Apr 20, 2024, 12:01 AM

#

Write index buffer is wayyyy too expensive. Ideas/options:
* Increase WG thread size
* Write out a buffer of surviving cluster IDs, using an indirect dispatch
* Combine culling and writing index buffer, somehow
* Use multi draw indirect, 1 IndirectDrawArgs per cluster
* Use a fixed size 64 * surviving_cluster_count draw, write out a buffer of surviving cluster IDs, and then have the vertex shader write NaN for excess invocations

left jacinth Apr 20, 2024, 12:01 AM

#

launch bound makes sense in that case I believe. I'd try looping to handle multiple things in 1 compute thread

#

but use 128 wg size either way

#

horrid ass 30 series cards

glass sphinx Apr 20, 2024, 12:02 AM

#

lol

#

yea 128-256 is a good default for things like this

primal shadow Apr 20, 2024, 12:02 AM

#

left jacinth horrid ass 30 series cards

does 40xx improve things?

glass sphinx Apr 20, 2024, 12:02 AM

#

all other nv and amd gpus

primal shadow Apr 20, 2024, 12:02 AM

#

left jacinth launch bound makes sense in that case I believe. I'd try looping to handle multi...

loop, or just use bigger WGs?

glass sphinx Apr 20, 2024, 12:02 AM

#

have a ratio of 2:1 for warp to workgropup slots on each sm

#

only the 30 series has a ratio of 3:1

#

they did that so that they can have more registers for raytracing for each thread but it turned out to bea shit tradeoff

left jacinth Apr 20, 2024, 12:03 AM

#

primal shadow loop, or just use bigger WGs?

both

wicked notch Apr 20, 2024, 12:03 AM

#

glass sphinx only the 30 series has a ratio of 3:1

dumb nv

left jacinth Apr 20, 2024, 12:03 AM

#

primal shadow does 40xx improve things?

every nv generation except 30xx does not have this horrid cripple ass limitation

#

I daily drive a 3070 🤡 🔫

glass sphinx Apr 20, 2024, 12:04 AM

#

to be clear tho, having larger wg sizes for these workloads is better on all gpus

wicked notch Apr 20, 2024, 12:04 AM

#

me when I have a 3070 too

glass sphinx Apr 20, 2024, 12:04 AM

#

its jsut that the 30 series can never ever reach full occcupancy at all with 64

primal shadow Apr 20, 2024, 12:04 AM

#

left jacinth both

Write surviving cluster buffer + indirect dispatch to get rid of "spawn 1 wg per cluster, just exist immediatly if culled" is also an option

#

Hard to say 😅

wicked notch Apr 20, 2024, 12:04 AM

#

https://tenor.com/view/rip-juice-cry-cat-kitten-tears-gif-15751834

Tenor

glass sphinx Apr 20, 2024, 12:04 AM

#

wicked notch me when I have a 3070 too

128 workgroups

#

i will check your shaders tomorrow

wicked notch Apr 20, 2024, 12:05 AM

#

I use 256

left jacinth Apr 20, 2024, 12:05 AM

#

just cope

glass sphinx Apr 20, 2024, 12:05 AM

#

nice

buoyant summit Apr 20, 2024, 12:06 AM

#

left jacinth I daily drive a 3070 🤡 🔫

I daily drive intel hd graphics

#

this bad boy has 2 MMUs

#

does your gpu have 2 MMUs?

#

that's right, it only has one

wicked notch Apr 20, 2024, 12:07 AM

#

nano can unlock the true potential of VSM

#

it's called vkQueueBindSparse

buoyant summit Apr 20, 2024, 12:08 AM

#

honestly idk how the fuck queue bind performance still sucks after 13 years first gpus shipped with this feature

#

on windows idk what's the problem but elsewhere certain drivers oversync to hell

glass sphinx Apr 20, 2024, 12:12 AM

#

its very sad

primal shadow Apr 20, 2024, 12:15 AM

#

hazah, write_index_buffer is now ~5.4ms using wg=128, 2x as fast!

#

....still very slow 😅

#

256 time now

glass sphinx Apr 20, 2024, 12:16 AM

#

niiice

#

in that case it really is bound by launch speed and the used warp slots

left jacinth Apr 20, 2024, 12:16 AM

#

launch bound 💀

#

https://tenor.com/view/h3-ethan-klein-drunk-ethan-klein-drunk-h3-ethan-h3-drunk-gif-2637047668041373838

Tenor

primal shadow Apr 20, 2024, 12:19 AM

#

I did the math here correct, right?

var cluster_id = 4u * dot(workgroup_id, vec3(num_workgroups.x * num_workgroups.x, num_workgroups.x, 1u));
let triangle_id = local_invocation_index % 64u;
cluster_id += local_invocation_index / 64u;

#

And is the compiler smart enough to combine the % and /?

#

4 clusters per workgroup

#

64 triangles per cluster

wicked notch Apr 20, 2024, 12:21 AM

#

you could easily just do >> 6 and & 0x3f if you don't trust your compiler

glass sphinx Apr 20, 2024, 12:21 AM

#

yes the compiler will find that

left jacinth Apr 20, 2024, 12:21 AM

#

the compiler will ABSOLUTELY do that for you

#

so long as you're dividing by a compile-time constant

glass sphinx Apr 20, 2024, 12:22 AM

#

if its not you die by stroke

left jacinth Apr 20, 2024, 12:22 AM

#

if it doesn't do it for you, abandon wgsl. Don't care.

glass sphinx Apr 20, 2024, 12:22 AM

#

(the divisor is not constant, you will be asleeped by the compiler forever)

glass sphinx Apr 20, 2024, 12:22 AM

#

left jacinth if it doesn't do it for you, abandon wgsl. Don't care.

driver compiler will do it

buoyant summit Apr 20, 2024, 12:24 AM

#

the other day I heard about a d3d11 game doing 1D dispatches and then in the shader it was translating that to 2D coordinate by doing stuff like % thingy.GetDimensions().x etc

glass sphinx Apr 20, 2024, 12:24 AM

#

primal shadow I did the math here correct, right? ```rust var cluster_id = 4u * dot(workgroup_...

you could do 64x4x1

#

then the math will be simpler

primal shadow Apr 20, 2024, 12:24 AM

#

oh, true lol

buoyant summit Apr 20, 2024, 12:24 AM

#

buoyant summit the other day I heard about a d3d11 game doing 1D dispatches and then in the sha...

where reactions

#

they are literally doing int div by a non-constant divisor

glass sphinx Apr 20, 2024, 12:25 AM

#

they are amazing i will hug them

wicked notch Apr 20, 2024, 12:25 AM

#

potrick is MS spy

buoyant summit Apr 20, 2024, 12:25 AM

#

hug of death

primal shadow Apr 20, 2024, 12:25 AM

#

Slightly faster with 256 WGs, 4.3ms now

glass sphinx Apr 20, 2024, 12:25 AM

#

niiice

#

no more launch bound

primal shadow Apr 20, 2024, 12:26 AM

#

What am I bound by now? Hard to tell

#

it still shows LGSB on bitfield extract

wicked notch Apr 20, 2024, 12:26 AM

#

primal shadow What am I bound by now? Hard to tell

wasted invocations and mem

primal shadow Apr 20, 2024, 12:27 AM

#

it's also still the same speed in both passes

#

so yeah, wasted invocations...

glass sphinx Apr 20, 2024, 12:27 AM

#

primal shadow What am I bound by now? Hard to tell

probably memory latency

#

wated invocations are ok

#

as long as its a good tradeoff compared to other options

primal shadow Apr 20, 2024, 12:28 AM

#

Memory latency, where? The atomic write? the looking up meshlet data?

primal shadow Apr 20, 2024, 12:30 AM

#

glass sphinx as long as its a good tradeoff compared to other options

Multi draw indirect is too slow, bottlenecks at command processing, single draw indirect is neccesary
Using fixed-size increments of 64 vertices and outputting NaN for excess triangles slows the draws down by a decent amount, but it's an option
Somehow combining culling and write index buffer shaders are an option, but difficult to get working well, as each thread is writing a different amount of vertices

glass sphinx Apr 20, 2024, 12:30 AM

#

atomic and writing yea

left jacinth Apr 20, 2024, 12:30 AM

#

primal shadow * Multi draw indirect is too slow, bottlenecks at command processing, single dra...

mesh shaders

primal shadow Apr 20, 2024, 12:30 AM

#

left jacinth mesh shaders

I don't have atm :/

left jacinth Apr 20, 2024, 12:31 AM

#

web is useless

#

don't care

primal shadow Apr 20, 2024, 12:31 AM

#

I think my best bet is either combine the shaders if I can, or make write_index_buffer an indirect dispatch

left jacinth Apr 20, 2024, 12:31 AM

#

garbage nonsense limitations don't allow your shaders to be limited just because Apple exists

#

or cope that that's why it's so slow

primal shadow Apr 20, 2024, 12:32 AM

#

left jacinth web is useless

I need to use wgpu regardless as the rest of the engine uses it

left jacinth Apr 20, 2024, 12:32 AM

#

yeah I'm just raging

buoyant summit Apr 20, 2024, 12:32 AM

#

afaiu you can just throw wgpu out of bevy?

#

just not use it

#

you mean you're working on the engine

primal shadow Apr 20, 2024, 12:33 AM

#

buoyant summit you mean you're working on the engine

Yes. I mean I could use vulkan directly, but I would lose way too much, and it'd have very little chance of getting upstreamed into bevy,

left jacinth Apr 20, 2024, 12:36 AM

#

I don't get how people aren't more upset about the limitations on web

#

like you are trying to make a nanite-like renderer

#

you absolutely need to utilize the hardware as much as possible for that to even be viable

#

mesh shader extensions were added to Vulkan SIX years ago

#

that's 2018. Vulkan was initially released in 2016... Mesh shaders are almost as old as Vulkan is 💀

#

but yeah web has no support 😐

#

patrick no

glass sphinx Apr 20, 2024, 12:42 AM

#

🤓 👆 the cross platform extension came 2022

left jacinth Apr 20, 2024, 12:43 AM

#

*for vulkan, yeah

glass sphinx Apr 20, 2024, 12:44 AM

#

web is always needs to support all the bad old hw

buoyant summit Apr 20, 2024, 12:44 AM

#

wgpu isn't restricted to web, they have some exts in there not available on "the web"

#

but yeah it seems like it's another layer of bikeshed

glass sphinx Apr 20, 2024, 12:45 AM

#

yes

primal shadow Apr 20, 2024, 1:11 AM

#

Fixed large bug, down to this now

glass sphinx Apr 20, 2024, 1:12 AM

#

nice

#

still quite heavy

primal shadow Apr 20, 2024, 1:12 AM

#

mhmm, write_index_buffer is still chunky due to all the extra dead workgroups

left jacinth Apr 20, 2024, 1:12 AM

#

Very well done, good job!

primal shadow Apr 20, 2024, 1:12 AM

#

and the 8ms 2x copy at the start is due to lack of ReBAR support

glass sphinx Apr 20, 2024, 1:12 AM

#

wait

#

is the write index buffer not indirect_

#

oooh i see

primal shadow Apr 20, 2024, 1:13 AM

#

nope

glass sphinx Apr 20, 2024, 1:13 AM

#

yea hahaha

#

ok i think that will make it much much faster

primal shadow Apr 20, 2024, 1:13 AM

#

Right, but now the culling pass needs to output a buffer of surviving clusters 🙂

#

and aotmic increment an indirect dispatch

#

I'm going to be sneaky and reuse the index/triangle/primitive/thing buffer for the surviving cluster buffer though

#

Maybe, idk

glass sphinx Apr 20, 2024, 1:14 AM

#

👍

primal shadow Apr 20, 2024, 2:32 AM

#

Is there a way I can avoid the large data upload at the start of the frame? For each cluster in the scene (there might be millions), upload 2 u32s (meshlet ID, and instance ID). Is there a way I can make this persistent, maybe? Idk.

#

Instance ID tells the cluster what instance/entity transform to read

#

Meshlet ID tells the cluster what meshlet asset it's an instance of

primal shadow Apr 20, 2024, 3:23 AM

#

No matter what I do, I'm going to need more buffers 😭

#

Buffers for telling software raster what cluster to use, and same for hardware now

#

I mean 4 bytes per cluster is not much, but dynamically managing the buffer size is sucky

wispy spear Apr 20, 2024, 10:50 AM

#

left jacinth web is useless

thank god im not alone 🙂 +1 for all the things you said

glass sphinx Apr 20, 2024, 11:19 AM

#

primal shadow I mean 4 bytes per cluster is not much, but dynamically managing the buffer size...

its a lot actually

#

its way too big considering pcie speeds

#

you should have a list of meshlets per mesh, a list of meshes per entity and a list of entities.
Then expand from entity, to meshlist to meshes to meshlets

#

uploading 8 bytes per meshlet a frame will never scale its way too much data

glass sphinx Apr 20, 2024, 11:23 AM

#

primal shadow Fixed large bug, down to this now

btw it looks like every single command has its own command buffer for some reason

#

didnt you have that already and nanite like lod selection? maybe im confusing projects

#

wait we are in lvstris channel the whole time bleakekw

wispy spear Apr 20, 2024, 11:32 AM

#

https://tenor.com/view/always-has-been-among-us-astronaut-space-betrayal-gif-23836476

Tenor

left jacinth Apr 20, 2024, 2:50 PM

#

lmao

#

is that deccer from DeccerCubes

glass sphinx Apr 20, 2024, 2:53 PM

#

yes

#

lmao

left jacinth Apr 20, 2024, 2:57 PM

#

they named a user after DeccerCubes

glass sphinx Apr 20, 2024, 2:59 PM

#

they named a cartoon after a shadow artefact

primal shadow Apr 20, 2024, 3:10 PM

#

glass sphinx btw it looks like every single command has its own command buffer for some reaso...

Wgpu does that, apparently... Not sure how much perf that costs me, but

glass sphinx Apr 20, 2024, 3:11 PM

#

bleakekw

primal shadow Apr 20, 2024, 3:11 PM

#

glass sphinx uploading 8 bytes per meshlet a frame will never scale its way too much data

I know :(. But I dispatch basically over a flat list of meshlets. How else does each culling thread know what to operate on? I can't figure out how to avoid it...

glass sphinx Apr 20, 2024, 3:12 PM

#

for you its probably not too much

primal shadow Apr 20, 2024, 3:12 PM

#

glass sphinx didnt you have that already and nanite like lod selection? maybe im confusing pr...

Yeah I have runtime nanite like lod selection

glass sphinx Apr 20, 2024, 3:13 PM

#

there are multiple ways to do this

#

the simplest is prefix sum + binary search

#

https://gist.github.com/Ipotrick/35dcd1f5f96b000e85f4f22d273ec677

Gist

power of two indirect argument buffers

power of two indirect argument buffers. GitHub Gist: instantly share code, notes, and snippets.

#

in this gist i link example code for binary search + prefix sum

primal shadow Apr 20, 2024, 3:14 PM

#

glass sphinx the simplest is prefix sum + binary search

Ok lol I did think of that actually, but it seemed like it would be slow to have every thread of every meshlet pass do a whole binary search 😅.

primal shadow Apr 20, 2024, 3:15 PM

#

glass sphinx https://gist.github.com/Ipotrick/35dcd1f5f96b000e85f4f22d273ec677

Oh nice this describes exactly my problem lol. Let me give this a read.

buoyant summit Apr 20, 2024, 3:15 PM

#

glass sphinx btw it looks like every single command has its own command buffer for some reaso...

every single dispatch or something

glass sphinx Apr 20, 2024, 3:16 PM

#

primal shadow Ok lol I did think of that actually, but it seemed like it would be slow to have...

it has some oberhead yes but its ok

#

its much better then uploading all from cou bleakekw

buoyant summit Apr 20, 2024, 3:16 PM

#

wgpu sync moment

glass sphinx Apr 20, 2024, 3:16 PM

#

you can go hard and do the power of two argument buffers

#

thats very fast

primal shadow Apr 20, 2024, 3:16 PM

#

buoyant summit wgpu sync moment

Yeah it's inefficient tracking. There's an issue open for redoing compute shader tracking at some point to be more fine grained.

glass sphinx Apr 20, 2024, 3:17 PM

#

wgpu also spamms hash maps

buoyant summit Apr 20, 2024, 3:17 PM

#

~~the correct answer is not tracking~~

glass sphinx Apr 20, 2024, 3:17 PM

#

and lazyliy caches so much

#

its really bad imo

#

dont like

buoyant summit Apr 20, 2024, 3:17 PM

#

ye

primal shadow Apr 20, 2024, 3:25 PM

#

glass sphinx you can go hard and do the power of two argument buffers

I read the whole thing, didn't quite understand the last solution 🤔. Mind explaining it again?

#

The power of 2 argument buffers

glass sphinx Apr 20, 2024, 3:26 PM

#

one sec

#

the idea is

#

for example

#

meshlet cull shader feeds meshlet draw shader

#

i will answer soon

primal shadow Apr 20, 2024, 3:30 PM

#

No rush, thanks

glass sphinx Apr 20, 2024, 3:33 PM

#

sorry other way

#

so mesh cull shader, each thread does one mesh

#

expands to meshlet cull shader

#

each thread does a meshlet

#

the eypansion between them is asymmetrical

#

as meshes have differing amounts of meshlets

#

the idea now is

#

to use a property of binary numbers

#

in the mesh culk you take the bumber of meshlets, and write one of 32 buffers, for each bit set in the meshlet count number

#

this way, instead of writing n values where n is meshlet count

#

you write up to 32 times, one time per set bit

#

then these 32 buffers contain a aegument each

#

deoending on their "power" each elemnt desceibes 2^n work items

#

so then

#

you do 32 dispatch indirect

#

for each arg buffer

#

for the meshlet cull shader

#

and each thread can simply bitshift their id by the power

#

to get the argument

#

its a little involved

#

hard to explain fir me

primal shadow Apr 20, 2024, 3:40 PM

#

Uhh so let's say you have 10 meshlets

#

And it gets culled down to 4

#

Can you give a practical example of what it would look like?

glass sphinx Apr 20, 2024, 3:58 PM

#

yea

#

id you have a mesh with 12 meshlets. It will first cull the mesh if it survivies it will write arguments to cull 12 mehslets

#

thats 1100 binary

#

so it will write the 3rd and 4th buffer

primal shadow Apr 20, 2024, 4:03 PM

#

what if you have 2 meshes with 12 meshlets?

glass sphinx Apr 20, 2024, 4:03 PM

#

then

primal shadow Apr 20, 2024, 4:03 PM

#

Do you need 32 buffers per mesh?

glass sphinx Apr 20, 2024, 4:03 PM

#

no

#

you append to them

primal shadow Apr 20, 2024, 4:04 PM

#

append how?

glass sphinx Apr 20, 2024, 4:05 PM

#

um

#

i ll make an example when im homr

#

sorry i cant rn did an oopsie in planning

primal shadow Apr 20, 2024, 4:06 PM

#

nah nw, I'll keep thinking on it. Thanks!

glass sphinx Apr 20, 2024, 7:24 PM

#

Ok

#

so an example would be

#

Surviving Meshes meshlet counts: 
mesh0:   6 meshlets              
mesh1:   3 meshlets              
mesh2:   1 meshlet               
mesh3:   5 meshlets         
mesh4:   7 meshlets

#

Each bit in the meshlet count maps to an arg buffer:

Surviving Meshes meshlet counts: | arg buffers appended:
mesh0:   6 meshlets              | 00000110
mesh1:   3 meshlets              | 00000011
mesh2:   1 meshlet               | 00000001
mesh3:   5 meshlets              | 00000101
mesh4:   7 meshlets              | 00000111

#

in this example we have 8 arg buffers

#

so the arg buffers will contain

#

arg buffer 0:
[mesh: mesh1, meshlet offset: 0] [mesh2, 0] [mesh3, 0]
arg buffer 1:
[mesh0, 0] [mesh1, offset: 1]
arg buffer 2:
[mesh0, 2] [mesh3, 1]

#

each argument is a pair of mesh index and meshlet offset into the mesh

#

so when for example looking at mesh0, you will append to arg buffer 1 and arg buffer 2

#

into 1 it will append mesh0, offset 0 and into arg buffer 2 it will append mesh0, offset 1

loud crag Apr 20, 2024, 7:33 PM

#

you using metal or why are you calling these argument buffers

glass sphinx Apr 20, 2024, 7:34 PM

#

arguments is the naming convention for some renderers for buffers that determine what threads in indirect dispatches map to

primal shadow Apr 20, 2024, 7:36 PM

#

So max of 8 meshlets per mesh, so 1 + ceil(log2(8)) = 4 argument buffers?

glass sphinx Apr 20, 2024, 7:36 PM

#

yes

#

in this example with 8 you could do 2^8-1 meshlets per mesh

primal shadow Apr 20, 2024, 7:37 PM

#

Then after culling, if there are N meshlets left for a mesh, append this mesh to the argument buffer X for all set bits X in N?

#

And ok, you append the mesh ID to each of those argument buffers

#

But what are the meshlet offsets?

glass sphinx Apr 20, 2024, 7:39 PM

#

primal shadow Then after culling, if there are N meshlets left for a mesh, append this mesh t...

you append in the mesh culling not meshlet culling

#

these arg buffers are then read by meshlet culling

#

all meshlets of a mesh go into arg buffers

primal shadow Apr 20, 2024, 7:39 PM

#

Right, I don't actually do mesh culling, I want to use this for meshlets culling -> triangles, but same concept so anyways

glass sphinx Apr 20, 2024, 7:40 PM

#

primal shadow But what are the meshlet offsets?

technically not needed

#

it tells the shader at what point to start working on the mehslet list

#

so for the example of mesh3:
first arg is in arg buffer 1: [mesh3, 0].
second arg is in arg buffer 2: [mesh3, 2]
in the meshlet cull, the threads read their dispatches power, so arg buffer 1 has power 2, arg buffer 2 has power 4.
they div their thread count by power and mod it by power.
the dived value is the arg index, the moded value is the offset inside the arg.

#

the first arg isin buffer 1

#

starting at 0

primal shadow Apr 20, 2024, 7:42 PM

#

glass sphinx the first arg isin buffer 1

Buffer 0 you mean?

#

What does starting at 0 mean though

#

I'm lost on what the offsets are for

#

Also, won't this ruin cache efficiency, if you're spreading the meshlets for a single mesh out across workgroups?

glass sphinx Apr 20, 2024, 7:43 PM

#

primal shadow What does starting at 0 mean though

starting at the 0th meshlet of mesh3

#

two threads will cull in arg buffer 1 dispatch

#

then in the arg buffer 2 dispatch, 4 threads will work on the rest of the meshlets

#

but they need to start at meshlet index 2

#

as the first arg buffer dispath alreadz worked on the first two meshlets

#

technically

#

you can also get that value by masking the mehslet count lower then the powers index

#

i will improve the example

primal shadow Apr 20, 2024, 7:46 PM

#

Please do, I am lost 😅

glass sphinx Apr 20, 2024, 7:46 PM

#

yea its hard to understand

#

a lot of bit twisting and indirections

#

i thing better examples

#

will make it clear

glass sphinx Apr 20, 2024, 10:00 PM

#

#

I dont have mroe time today

#

i will expand this next week

delicate rain Apr 20, 2024, 10:07 PM

#

"I don't have time" says bro while posting a detailed elaborate graph with colors and explanations

glass sphinx Apr 20, 2024, 10:16 PM

#

#

primal shadow Apr 20, 2024, 10:32 PM

#

ooooh ok this makes sense now!
Each item in an argbuffer represents 2^i units of work
hence the mesh_id of where it came from, and the meshlet_offset

So given the 13 meshlets in your example
it would be:
argbuf0: offset=0..=0
argbuf2: 1..=4
argbuf3: 5..=12
except you only need to store 0, 1, 5

ok two questions then:

How is this cache efficient, if you split the meshlets for a single mesh between many argbuffers?
Isin't this very lopsided work-wise? Each argbuffer has exponentially more args to process. Are you supposed to then take that mapping, and redistribute it somehow across the threads? Or is it fine
Also I need to go figure out how many threads you spawn to process each argbuffer, and how it'll work with multiple meshes

delicate rain Apr 20, 2024, 10:36 PM

#

Regarding the second point, I believe you'd just launch the appropriate number of threads for each buffer, so that each thread gets one

#

First question I don't understand, I doubt cache comes into play on a meshlet granularity

primal shadow Apr 20, 2024, 10:37 PM

#

meshlets need to lookup their mesh's transform and such

#

I also want to try and use this for meshlet->triangle writing

glass sphinx Apr 20, 2024, 10:40 PM

#

primal shadow ooooh ok this makes sense now! Each item in an argbuffer represents 2^i units of...

for small numbers its not cache efficient

#

it will also make the atomics much slower

#

as you need to do more of them

#

but for large numbers the bandwidth and divergence benefits win

#

i didnt test much but meshlet count over like 8-32 were better already with the po2 buffers

#

any if you have larger meshes its muuuhc better

#

I dont get point 2

#

how would it be lopsided

primal shadow Apr 20, 2024, 10:43 PM

#

Ok no I realized something

#

it's not 1 thread per arg

delicate rain Apr 20, 2024, 10:43 PM

#

The idea is to launch more threads for bigger arg buffers

primal shadow Apr 20, 2024, 10:43 PM

#

It's 1 thread per unit of work, of which there is 2^i units of work per arg

glass sphinx Apr 20, 2024, 10:43 PM

#

you launch (argcnt * 2^(arg buff index)) threads per arg buffer

#

so you waste nearly 0 threads

#

the po2 solution is by far the fastest i could find so far for this workload btw

#

what could be much better are workgraphs

#

they solve this on an api level

delicate rain Apr 20, 2024, 10:45 PM

#

Is cache locality that big of a deal on GPUs anyways btw? You have way more latency hiding compared to CPU, I don't really hear much about it

glass sphinx Apr 20, 2024, 10:45 PM

#

its extreamly important

#

much much more then on cpus

delicate rain Apr 20, 2024, 10:45 PM

#

Really, hmmm

glass sphinx Apr 20, 2024, 10:45 PM

#

and its getting worse each gen

#

as compute gets faste rmuch faster then cache

delicate rain Apr 20, 2024, 10:46 PM

#

Interesting yeh, I've not really been paying much attention to cache when writing shaders

#

Other than shared mem I guess

glass sphinx Apr 20, 2024, 10:46 PM

#

its ver intrecate

delicate rain Apr 20, 2024, 10:46 PM

#

There is so much to think about

glass sphinx Apr 20, 2024, 10:46 PM

#

it can be better to shit on cache locality if it means you safe a ton of mem bandidth like in this case

#

but you can do a lot of warp level optimizations btw @primal shadow that make the cache locality pretty good again fo the writes

primal shadow Apr 20, 2024, 10:47 PM

#

glass sphinx you launch (argcnt * 2^(arg buff index)) threads per arg buffer

Does each arg buffer have to be a seperate dispatch?

glass sphinx Apr 20, 2024, 10:47 PM

#

yes

primal shadow Apr 20, 2024, 10:47 PM

#

glass sphinx what could be much better are workgraphs

haha I'm not getting those for ages 😅

primal shadow Apr 20, 2024, 10:48 PM

#

glass sphinx yes

64 dispatches (64 triangles per meshlet) seems... expensive?

delicate rain Apr 20, 2024, 10:48 PM

#

That reminds me btw @wicked notch if you make your pages 128 your mark visible pages pass (if you scalarize) will be > double the speed

primal shadow Apr 20, 2024, 10:48 PM

#

especially because I have no way of overlapping them...

glass sphinx Apr 20, 2024, 10:48 PM

#

64 dispatches are nothing

#

they will auto overlap

primal shadow Apr 20, 2024, 10:48 PM

#

no wgpu will insert barriers 😭

glass sphinx Apr 20, 2024, 10:48 PM

#

i really hope wgpu is not that crippled

wicked notch Apr 20, 2024, 10:48 PM

#

delicate rain That reminds me btw <@320895822394818561> if you make your pages 128 your mark v...

it's already blazing fast for me

glass sphinx Apr 20, 2024, 10:49 PM

#

bleakekw

delicate rain Apr 20, 2024, 10:49 PM

#

wicked notch it's already blazing fast for me

Hmm really

#

How fast?

primal shadow Apr 20, 2024, 10:49 PM

#

each dispatch is treated as a "usage scope", and will ocmplete in order via barriers between them :/

delicate rain Apr 20, 2024, 10:49 PM

#

Was like 1.2 ms without scalarization for me

wicked notch Apr 20, 2024, 10:49 PM

#

delicate rain How fast?

about 34 microseconds

delicate rain Apr 20, 2024, 10:50 PM

#

Wtf

glass sphinx Apr 20, 2024, 10:50 PM

#

primal shadow 64 dispatches (64 triangles per meshlet) seems... expensive?

its also a dispatch per arg buffer so 24 or 32

#

not 64

delicate rain Apr 20, 2024, 10:50 PM

#

How

wicked notch Apr 20, 2024, 10:50 PM

#

I really do nothing special there

glass sphinx Apr 20, 2024, 10:50 PM

#

show code

delicate rain Apr 20, 2024, 10:50 PM

#

Do you do atomics or?

wicked notch Apr 20, 2024, 10:50 PM

#

https://github.com/LVSTRI/Retina/blob/master/src/Retina/Sandbox/Shaders/VirtualShadowMarkActivePages.comp.glsl

primal shadow Apr 20, 2024, 10:50 PM

#

glass sphinx its also a dispatch per arg buffer so 24 or 32

64 triangles per meshlets = 64 arg buffers, no?

wicked notch Apr 20, 2024, 10:50 PM

#

I do no atomics

glass sphinx Apr 20, 2024, 10:50 PM

#

primal shadow 64 dispatches (64 triangles per meshlet) seems... expensive?

oh if you can only go up to 64 you only need 6 dispatches

glass sphinx Apr 20, 2024, 10:51 PM

#

primal shadow 64 triangles per meshlets = 64 arg buffers, no?

no

delicate rain Apr 20, 2024, 10:51 PM

#

Ah no caching

#

Figures

glass sphinx Apr 20, 2024, 10:51 PM

#

the arg buffers double the work each time

delicate rain Apr 20, 2024, 10:51 PM

#

Nevermind then

primal shadow Apr 20, 2024, 10:51 PM

#

log2(64) = 6*

wicked notch Apr 20, 2024, 10:51 PM

#

ah yeah with caching it would be a problem bleakekw

wicked notch Apr 20, 2024, 10:51 PM

#

primal shadow log2(64) = 6*

6

glass sphinx Apr 20, 2024, 10:51 PM

#

primal shadow log2(64) = 6*

6

primal shadow Apr 20, 2024, 10:51 PM

#

shhh

glass sphinx Apr 20, 2024, 10:51 PM

#

keko

frank sail Apr 20, 2024, 10:51 PM

#

6

primal shadow Apr 20, 2024, 10:51 PM

#

ok maybe 6 is fine barrier wise then

delicate rain Apr 20, 2024, 10:51 PM

#

wicked notch ah yeah with caching it would be a problem <:bleakekw:1082598350303539240>

You need atomics and perf dies ☹️

glass sphinx Apr 20, 2024, 10:52 PM

#

atomics scalarization makes it warp speed again

delicate rain Apr 20, 2024, 10:54 PM

#

Young man does image store with no atomics with multiple threads writing the same spot

#

Is that legal?

wicked notch Apr 20, 2024, 10:54 PM

#

it's ub that just works

#

one could do imageAtomicStore

delicate rain Apr 20, 2024, 10:55 PM

#

And watch your perf plummet

wicked notch Apr 20, 2024, 10:55 PM

#

yes kekkedsadge

delicate rain Apr 20, 2024, 10:55 PM

#

Btw how do you allocate? You do it after?

wicked notch Apr 20, 2024, 10:55 PM

#

ye

delicate rain Apr 20, 2024, 10:55 PM

#

Loop through VSM and find all visible this frame and allocate

glass sphinx Apr 20, 2024, 10:56 PM

#

wicked notch it's ub that just works

froge i will get the hammer

wicked notch Apr 20, 2024, 10:56 PM

#

ok I will fix

glass sphinx Apr 20, 2024, 10:56 PM

#

delicate rain Young man does image store with no atomics with multiple threads writing the sam...

how does that make sense

#

is it jsut marking it?

#

like or iing a 1 on?

delicate rain Apr 20, 2024, 10:57 PM

#

They all write the same ye

wicked notch Apr 20, 2024, 10:57 PM

#

yes

glass sphinx Apr 20, 2024, 10:57 PM

#

then atomic store might actually just be the same speed

#

on nv hw

#

could be that that stores of that size are inherently atomic

#

relaxed atomic writes on x86 are also completely free

delicate rain Apr 20, 2024, 10:58 PM

#

Sad that I need to or

#

But I don't want to make another imaž just for this bitmask

glass sphinx Apr 20, 2024, 10:58 PM

#

yea

wicked notch Apr 20, 2024, 10:59 PM

#

it's a 2KiB image it's fiine

delicate rain Apr 20, 2024, 10:59 PM

#

Another fetch is what makes me sad

#

Not the memory itself

wicked notch Apr 20, 2024, 10:59 PM

#

the whole image stays in L2 cache

#

it's fiiine

glass sphinx Apr 20, 2024, 10:59 PM

#

the tap is shit but if its so small its probably ok

delicate rain Apr 20, 2024, 10:59 PM

#

You make convincing point

#

Alas it's not even near to being the bottleneck

glass sphinx Apr 20, 2024, 11:00 PM

#

also the scalarization would still help btw

delicate rain Apr 20, 2024, 11:00 PM

#

As Jaker would say, it doesn't matter for now

glass sphinx Apr 20, 2024, 11:00 PM

#

less contention still

#

real

#

its fine like so

#

tido crashes on amd shader compiler crashes

#

goofy aah drivers

delicate rain Apr 20, 2024, 11:01 PM

#

I just noticed the pass spiking to 130us when I reduced the page size to 64

glass sphinx Apr 20, 2024, 11:01 PM

#

why 64

delicate rain Apr 20, 2024, 11:01 PM

#

Compared to 30us for 128 page size

#

More pages == less scalarization

delicate rain Apr 20, 2024, 11:01 PM

#

glass sphinx why 64

Was just experimenting with values

glass sphinx Apr 20, 2024, 11:01 PM

#

yes why do you have more pages now

#

is it important

#

ah

#

okk

delicate rain Apr 20, 2024, 11:02 PM

#

You theoretically get better culling

glass sphinx Apr 20, 2024, 11:02 PM

#

eh

delicate rain Apr 20, 2024, 11:02 PM

#

And tighter allocation

wicked notch Apr 20, 2024, 11:02 PM

#

I also tried 64 but for a different reason

#

I wanted to see whether smaller pages could help with projective aliasing

delicate rain Apr 20, 2024, 11:02 PM

#

wicked notch I also tried 64 but for a different reason

You have 64 now

wicked notch Apr 20, 2024, 11:02 PM

#

(they don't)

delicate rain Apr 20, 2024, 11:02 PM

#

It doesn't matter what size the page is

#

For the resulting shadow quality

#

The information density is the same

wicked notch Apr 20, 2024, 11:03 PM

#

ye I realized that later KEKW

delicate rain Apr 20, 2024, 11:03 PM

#

You get tighter allocation and culling

wicked notch Apr 20, 2024, 11:03 PM

#

btw

#

did you figure out how to reduce the VSM

#

because there's a resolution difference between the VSM itself and the virtual resolution

delicate rain Apr 20, 2024, 11:03 PM

#

Was?

#

I don't follow

#

Reduce for hiz you mean?

wicked notch Apr 20, 2024, 11:04 PM

#

ye

delicate rain Apr 20, 2024, 11:06 PM

#

I mean you just make MIPS of the memory block and reduce each page there (the mapping is preserved 64 wide page become 32 wide in MIP 1) and once you reach max MIP (each page is 1 Texel) you use that to build the virtual HIZ on top (starting at page table resolution and going again Down to a single texel)

#

That is my current working plan for implementing that

wicked notch Apr 20, 2024, 11:07 PM

#

hmm I see

delicate rain Apr 20, 2024, 11:08 PM

#

but these are future plans

#

I am stuck on fixing tido bugs and dynamic object shadows

#

I wanted to generate the invalidated pages mask on the GPU but the geo pipeline we have in Tido is hellish (and I still don't really understand it) so I'll prob cope with CPU approach

primal shadow Apr 20, 2024, 11:41 PM

#

So uhh, if I just remove the write_index_buffer pass entirely, and go back to doing it from the culling shader itself, its much faster...

if meshlet_visible {
    let meshlet = meshlets[meshlet_id];
    let start = atomicAdd(&draw_indirect_args.vertex_count, meshlet.triangle_count * 3u) / 3u;
    let base = cluster_id << 8u;
    for (var triangle_id = 0u; triangle_id < 64u; triangle_id++) {
        if triangle_id >= meshlet.triangle_count { return; }
        draw_triangle_buffer[start + triangle_id] = base | triangle_id;
    }
}

glass sphinx Apr 20, 2024, 11:47 PM

#

probably cause nearly all threads were dead in the write index buffer pass

#

for writing the indices i think it might be better to just loop btw

#

cause the numbers might be too low and similar to get any gains from po2 arg buffers

primal shadow Apr 20, 2024, 11:50 PM

#

glass sphinx cause the numbers might be too low and similar to get any gains from po2 arg buf...

yeah I think it would be problematic

primal shadow Apr 21, 2024, 12:44 AM

#

#

I've gotten it down to a very very reasonable amount of time 🙂 (minus the data uploads, stupid data uploads)

primal shadow Apr 21, 2024, 2:44 AM

#

Ok, idea

#

We already have a list of instances

#

Run a compute shader to write out the per-cluster data directly

#

I could also slap on per instance culling in the same shader

glass sphinx Apr 21, 2024, 10:24 AM

#

you should probably look at othe rpeojects for this as well

#

most of this has been done many times already

primal shadow Apr 24, 2024, 6:57 AM

#

Occlusion culling was broken (accidently bound 1 mip, instead of every mip), fixed it. Still some overdraw, but that's always been the case. I need to tweak the occlusion culling some more.

primal shadow Apr 25, 2024, 4:57 PM

#

glass sphinx you should probably look at othe rpeojects for this as well

I think it's time for persistent threads for culling tbh.

#

Should save a lot of culling time, and also solve my data upload problem. I won't need to upload 8 byters per instance of a meshlet every frame. Instead, I can just upload 1 u32 per mesh instance pointing to the BVH root of that mesh's meshlet tree.

glass sphinx Apr 25, 2024, 5:51 PM

#

you dont need persistent threads for that

#

you'll have to make a lockless queue and make a table of device to threadcount for it

#

very shaky

primal shadow Apr 25, 2024, 5:52 PM

#

Well, the alternative is a bunch of dependent dispatches to traverse the tree...

primal shadow Apr 25, 2024, 5:52 PM

#

glass sphinx very shaky

yeah, but perf

primal shadow Apr 25, 2024, 8:33 PM

#

So my test scene has 12_463_400 clusters

#

I upload 8 bytes per cluster each frame

#

PCIE x16 Gen3 bandwidth is 16 GB/s

#

12_463_400 * 8 ~= 99.7 MB

#

At 16 MB/ms, that's ~6.23ms of upload time

#

NSight shows 92% throughput, and overall ~8.4ms of upload time

#

So yeah, far too expensive

glass sphinx Apr 25, 2024, 8:35 PM

#

yea thats horrible

primal shadow Apr 25, 2024, 8:35 PM

#

I need to reduce this down to per-instance data upload

glass sphinx Apr 25, 2024, 8:35 PM

#

you can also make it completely gpu driven

primal shadow Apr 25, 2024, 8:36 PM

#

The power2 dispatch trick would work well, and then write out per-cluster data from a compute shader, but I would have to put a barrier between every compute dispatch because of wgpu (unless I hack around it?)

#

Alternatives include making the clusters form a tree of LODs, and writing out only the LOD tree ID per instance, and then making the shader traverse the tree instead of a flat list

#

Or uploading a per-instance prefix sum of meshlet counts per instance, and having all 3 shaders (cull, raster, shading from visbuffer) do a binary search to map thread ID -> instance ID :/

primal shadow Apr 25, 2024, 8:37 PM

#

glass sphinx you can also make it completely gpu driven

How so?

primal shadow Apr 25, 2024, 8:58 PM

#

Oh wait, there's another option here

#

Upload the prefix sum yes, but then have a pass at the start perform the binary search and output flat buffers

glass sphinx Apr 25, 2024, 9:05 PM

#

do zou do the lod traversal and nanite thing on cpu_

#

?

#

i mean you can try persistent threads

#

but i think that will be pain

primal shadow Apr 25, 2024, 9:06 PM

#

glass sphinx do zou do the lod traversal and nanite thing on cpu_

No, it's on the GPU, over a flat list of meshlets (you don't have to use a tree)

primal shadow Apr 25, 2024, 9:39 PM

#

On the CPU, compute a prefix sum buffer of meshlet counts per instance, and indices to the start of the meshlet buffer for that instance
On the GPU, run a new pass once at the start of the frame, 1 thread per instance of a meshlet (same as culling). Each thread (cluster) does a binary search on the prefix sum meshlet count list to find their instance ID, plus meshlet asset ID, and then writes it to the buffers for subsequent passes.

primal shadow Apr 26, 2024, 1:27 AM

#

New buffers:
    meshlet_mesh_slices = [2 u32's per meshlet_mesh asset containing buffer slice]
    instance_meshlet_meshes = [map of instance -> meshlet_mesh_slices id]
    meshlet_count_prefix_sum = [scratch buffer of size instance_count]

Shader 1: <perform prefix sum of 1 thread per instance to fill out meshlet_count_prefix_sum>
Shader 2: <follow below of 1 thread per cluster to fill out thread_instance_ids and thread_meshlet_ids>

let cluster_id = global_invocation_id.x;
let instance_id = <binary search in meshlet_count_prefix_sums using cluster_id>;
let meshlet_mesh_id = instance_meshlet_meshes[instance_id];
let (meshlet_start, meshlet_end) = meshlet_mesh_id[meshlet_mesh_id];
let meshlet_id = meshlet_count_prefix_sums[instance_id - 1] - cluster_id; // If instance_id = 0, meshlet_id = cluster_id

meshlet_thread_instance_ids[cluster_id] = cluster_id;
meshlet_thread_meshlet_ids[cluster_id] = meshlet_id;

primal shadow Apr 26, 2024, 1:55 AM

#

@wicked notch do your clusters (LOD 0 or otherwise) reference a single set of vertices, or do you have a new set of vertices per cluster?

wicked notch Apr 26, 2024, 1:56 AM

#

per group but yes

#

idk if I want to keep that, but it's nice and shrimple

primal shadow Apr 26, 2024, 1:57 AM

#

hrmm, I have a single set of vertices, and just reference them per cluster

#

everyone seems to do the opposite though

honest socket Apr 26, 2024, 2:57 AM

#

having a single set of vertices is gonna eventually not be compatible with streaming when we add that

primal shadow Apr 26, 2024, 3:12 AM

#

mhm

glass sphinx Apr 26, 2024, 10:09 AM

#

ive never heard of that

#

every impl i saw is referencing a single set of verts

#

aside from those that quantize

primal shadow Apr 27, 2024, 5:43 AM

#

Saved about 32ms of frame time in my nanite impl 😄

#

~8.4ms GPU, ~17.5-27.8ms CPU

#

No more uploading per-cluster data

#

170 FPS

wide shadow Apr 27, 2024, 6:20 AM

#

This is like nanite outpost. Maybe this info will be useful.
Guys at epic games have been really busy making nanite better. Now you can have tessellation, displacement and skeletal animation with nanite.
They got rid of material buffer(abusing depth buffer with OP EQUAL) and now they shade in compute shader. They detect which derivatives(by looking at the dissembly of the vendor blob) are used and then select which pipeline is used by doing this they reduced amount of helper invocations by 6-10%. They now use VRS which 27-45% reduction in shading. Also something to make lpotrick little jealous is that now they use work graphs.
There is nothing(only git commits) on how they did tesselation, displacement a skeletal animations yet.

#

https://www.unrealengine.com/en-US/blog/take-a-deep-dive-into-nanite-gpu-driven-materials

primal shadow Apr 27, 2024, 6:23 AM

#

Yeah I read their presentation (it's very long). There's also (not unreal) http://filmicworlds.com/blog/software-vrs-with-visibility-buffer-rendering/

#

VRS/compute shading is on my list todo, but not very high priority atm

#

workgraphs (and mesh shaders) I don't have access to sadly

#

the other stuff there are papers on, although idk how nanite specifically implements it (I don't read their code)

#

I'm still a good bit off of nanite 1.0 levels of perf, although I'm making a lot of progress

#

There's also VSMs, which I haven't implemented at all

#

Per-view nanite is very expensive... and I want shadow views...

wispy spear Apr 28, 2024, 10:30 PM

#

lustri, im really fond of the new auto thingy() -> shizzle {} way of conjuring c++

#

thanks for ze inzpiration and being a part of my life ❤️ 🙂

primal shadow Apr 28, 2024, 10:33 PM

#

wispy spear lustri, im really fond of the new `auto thingy() -> shizzle {}` way of conjuring...

Looks like Rust 🤔

wispy spear Apr 28, 2024, 10:33 PM

#

it do be

finite yacht Apr 28, 2024, 10:35 PM

#

i am curious, why do you prefer it over shizzle thingy() {}

wispy spear Apr 28, 2024, 10:35 PM

#

: )

finite yacht Apr 28, 2024, 10:37 PM

#

great argument

wispy spear Apr 28, 2024, 10:38 PM

#

that way all declarations are neatly aligned

#

virtual auto Initialize() -> bool;
virtual auto Load() -> bool;
virtual auto Unload() -> void;

virtual auto OnMouseMoved(double mouseX, double mouseY) -> void;
virtual auto OnMouseEntered() -> void;
virtual auto OnMouseLeft() -> void;
virtual auto OnFramebufferResized(int32_t width, int32_t height) -> void;
virtual auto OnFilesDropped(const std::vector<std::string>& fileNames) -> void;

finite yacht Apr 28, 2024, 10:39 PM

#

thats nice

loud crag Apr 28, 2024, 10:40 PM

#

wispy spear ```cpp virtual auto Initialize() -> bool; virtual auto Load() -> bool; virtual a...

tbh yeah, but I don't use trailing return type when the return type is void, bool, or any other four character type

#

and generally tbh I don't like trailing return types unless it's deductable

wispy spear Apr 28, 2024, 10:41 PM

#

i dont know what that means, or rather how it would look like

#

leaving stuff out feels weird

loud crag Apr 28, 2024, 10:42 PM

#

just as an example from my math thing im writing right now:

        constexpr auto operator*(vec<T, M> other) const noexcept {
            vec<T, M> ret(T(0));
            for (std::size_t i = 0; i < M; ++i)
                ret += col(i) * other[i];
            return ret;
        }

#

it just knows that the return type is vec<T, M>, and then just auto is enough

wispy spear Apr 28, 2024, 10:42 PM

#

ah

loud crag Apr 28, 2024, 10:42 PM

#

same happens with lambdas, generally

wispy spear Apr 28, 2024, 10:42 PM

#

thats too next level for me still

#

although, in c# i do the same, when it comes to lambdas

#

and in c++ too i think, when it comes to lambdas

wispy spear Apr 28, 2024, 11:39 PM

#

eh the other thing is prefixing types

#

EMyEnum, SMyStruct, im not cool with CMyClass yet 🙂 too deep wounds from the 90s, but its a logical thing to do

glass sphinx Apr 28, 2024, 11:53 PM

#

primal shadow VRS/compute shading is on my list todo, but not very high priority atm

good vrs os extremely invasive in the whole renderer

#

its not something that can be added easily

glass sphinx Apr 28, 2024, 11:54 PM

#

primal shadow ~8.4ms GPU, ~17.5-27.8ms CPU

wait why is the cou time so high

#

shouldnt it be totally gpu bound if is nanite like?

primal shadow Apr 28, 2024, 11:55 PM

#

glass sphinx wait why is the cou time so high

It used to be, because I was uploading 8 bytes per cluster for like 128 million clusters. It's very little CPU load now.

glass sphinx Apr 28, 2024, 11:56 PM

#

ah thats the savings number

primal shadow Apr 28, 2024, 11:56 PM

#

Yes

wheat haven Apr 29, 2024, 5:57 PM

#

went to the retina repo for some inspiration and noticed you managed to get an all-numbers commit hash (at least the short part github shows)

severe dome Apr 29, 2024, 10:27 PM

#

wispy spear lustri, im really fond of the new `auto thingy() -> shizzle {}` way of conjuring...

i also prefer this style :^)

#

it becomes significantly more beneficial when you start doing things with templates

#

any kind of long name in a return type, or any kind of sfinae or type trait nonsense is completely unreadable when not using trailing return types

#

trailing return types for member functions have access to the namespace of thr class they're in, which is great if you're using aliases a lot

#

Foo::Bar<Foo::Baz> Foo::buh(); will always be inferior to auto Foo::buh() -> Bar<Baz>;

#

just better in every way

wheat haven Apr 29, 2024, 10:32 PM

#

you're saying that normal return types can't? I do that all the time

severe dome Apr 29, 2024, 10:33 PM

#

not if you define them outside the class body, no

wheat haven Apr 29, 2024, 10:33 PM

#

ah that, yeah

#

you gave declarations as examples so I was confused

severe dome Apr 29, 2024, 10:34 PM

#

I'm on mobile so didn't want to flesh them out :)

wispy spear Apr 29, 2024, 10:53 PM

#

: )

#

monsieur lustri also added rust like typisms ala "self" too, but im not there yet

severe dome Apr 30, 2024, 9:50 AM

#

can't use those bc I'm stuck on C++17 :(

primal shadow May 1, 2024, 5:31 AM

#

Ok since I'm going to have to do it 4+ times/view I decided to allocate buffers to hold instance+meshlet ID per cluster in the scene, instead of doing the binary search in every pass

#

Unfortnatently that means another 8 bytes/cluster of memory usage :((

#

100 million clusters = 800 mb 😦

#

In the scene, not rendered

#

And LOD clusters count towards that

#

~6.7ms/frame, but ~2ms of that is CPU time 😦

primal shadow May 3, 2024, 3:09 AM

#

I think if I want to save 8bytes/scene_cluster, I'll need to switch to persistent culling

#

Also save some culling time, probably like ~0.3ms

#

Culling is already extremely fast for me though

glass sphinx May 3, 2024, 10:05 PM

#

primal shadow 170 FPS

how many bunnies is that

primal shadow May 3, 2024, 10:08 PM

#

glass sphinx how many bunnies is that

~3100

primal shadow May 4, 2024, 6:55 PM

#

So I've been coming back to this. Yes, clipping the triangle vertices is wrong. But why not clip the triangle AABB you use for testing pixels covered by the triangle?

primal shadow May 4, 2024, 7:17 PM

#

Wait, I think Nanite only forces HW rasterization for depth clipping, and not for viewport

#

that makes more sense

wicked notch May 5, 2024, 11:31 AM

#

something's cooking lads

wispy spear May 5, 2024, 11:35 AM

#

pasta water?

wicked notch May 5, 2024, 11:36 AM

#

I cooked that earlier KEKW

#

something else's cooking

wispy spear May 5, 2024, 12:30 PM

#

Nanite 2.0's oven timer just went off

wispy spear May 9, 2024, 10:24 AM

#

lustri my frosch, you are using cmake's pchisms

#

but im a little bamboozled by it

#

you create a pch library so to speak, and in it you have all the stl headers you use across the retina

#

yet you include the pch nowhere, but whatever stl header individually again

#

does "linking with the pch lib" take care of the usual "include <pch.h>"?

frank sail May 9, 2024, 10:25 AM

#

I think the pch is automatically included by everything in the cmake way

wicked notch May 9, 2024, 10:26 AM

#

wispy spear does "linking with the pch lib" take care of the usual "include <pch.h>"?

yes

wispy spear May 9, 2024, 10:26 AM

#

that means you could remove the individual stl includes

#

neh?

frank sail May 9, 2024, 10:26 AM

#

I think it means those includes are now free

wispy spear May 9, 2024, 10:26 AM

#

at least that was my understanding how cmakes pchisms should work

wicked notch May 9, 2024, 10:26 AM

#

they're redundant yes

wispy spear May 9, 2024, 10:26 AM

#

ah

frank sail May 9, 2024, 10:26 AM

#

I'd keep them, imo it makes things more readable

wispy spear May 9, 2024, 10:26 AM

#

: )

#

that is/was the conchfusing conchfusion

#

thank you my froggies

#

gcc 14.1.1 is out btw

faint crane May 15, 2024, 8:48 PM

#

https://github.com/AIFanatic/three-nanite

GitHub

GitHub - AIFanatic/three-nanite: An attempt at reproducing unreal n...

An attempt at reproducing unreal nanite in threejs - AIFanatic/three-nanite

#

Someone beat me to it.

finite yacht May 18, 2024, 11:02 PM

#

When will your meshoptimizer fork support IDK_BC5_normal_metallicRoughness 🐸

wispy spear May 21, 2024, 11:29 AM

#

@wicked notch are you alright

wicked notch May 21, 2024, 11:30 AM

#

it depends

wispy spear May 21, 2024, 11:30 AM

#

i heard italy got hit by earfquakes, around napoli

wicked notch May 21, 2024, 11:30 AM

#

I don't live there KEKW

#

I haven't heard the news

wispy spear May 21, 2024, 11:30 AM

#

ah

#

was worried about your pizza oven 😄

frank sail May 21, 2024, 11:30 AM

#

reminds me of the earthquake that hit where I live, and I didn't feel shit

wicked notch May 21, 2024, 11:30 AM

#

we can make pizza outside of napoli you know

wispy spear May 21, 2024, 11:31 AM

#

😛

#

also last night jaker?

frank sail May 21, 2024, 11:31 AM

#

nah it was months ago

#

now I'm looking for a meme

wispy spear May 21, 2024, 11:32 AM

#

ah

#

: )

frank sail May 21, 2024, 11:33 AM

#

found it

primal shadow May 23, 2024, 3:28 AM

#

Automatic GLTF scene -> meshlet converter

#

https://cdn.discordapp.com/attachments/743663924229963868/1243045821302313102/image.png?ex=66500c22&is=664ebaa2&hm=5b5203c7abdab9a8491dfb6d767e7eb68279e8b87a267554596cd7e0b44ca00f&

#

https://cdn.discordapp.com/attachments/743663924229963868/1243045927678382180/image.png?ex=66500c3b&is=664ebabb&hm=7035630482e42945387f014936d880d5826229fb36b90973c1b830eb6bd29613&

loud crag May 23, 2024, 4:41 AM

#

@primal shadow what gltf extension do you mean exactly?

primal shadow May 23, 2024, 4:53 AM

#

loud crag <@145540119141679105> what gltf extension do you mean exactly?

I made my own GLTF extension so I could store meshlet data in it and integrate the new type of geoemtry with our existing scene and asset system

loud crag May 23, 2024, 4:58 AM

#

ah you said „bevy meshlet extension“ in showcase and that didnt sound like it was just custom for your engine

primal shadow May 23, 2024, 5:01 AM

#

Ah no it's a custom extension, bevy is the name of the engine I work on

faint crane May 23, 2024, 5:31 AM

#

Awesome work. Starting on this and my goal was to get Bistro itself rendering.

#

This is what I'm rolling with. Doesn't eat up 5GB of VRAM so it's a start,.

📎 bistro.glb

#

I read https://blog.traverseresearch.nl/creating-a-directed-acyclic-graph-from-a-mesh-1329e57286e5 but am following along https://jglrxavpok.github.io for comparison. They do merging or cleanup of vertices at some point, but still METIS + Meshoptimizer.

primal shadow May 23, 2024, 6:00 AM

#

If you want another point of comparison, my code is at https://github.com/bevyengine/bevy/blob/main/crates/bevy_pbr/src/meshlet/from_mesh.rs. I haven't implemented vertex welding/merging yet.

#

I'm also working on writing up a (very long) blog post detailing how everything works in my meshlet renderer so far, that'll come out sometime in the next month probably

faint crane May 23, 2024, 6:04 AM

#

Nice. Looked over the earlier PRs. Yours is definitely easiest to follow if nothing but a reminder that my occlusion culling doesn't fully work. brianna_pls

#

Seems like an unending problem.

primal shadow May 23, 2024, 6:05 AM

#

Neither does mine, it seems to over-cull on thin planes 😦

#

The screenshots of bistro I took were carefully positioned to avoid showing the bug 😅

faint crane May 23, 2024, 6:07 AM

#

I'll figure something out. Using WebGPU via Chrome so a combination of JS and Rust/WASM.

#

Will at least create new bugs.

primal shadow May 23, 2024, 6:11 AM

#

WebGPU shouldn't really limit you for this. No min sampler mode, but it's not neccesary anyways.

#

It's just a lot of working implementing everything and maing it fast 😅

#

I still need to redo a lot of things

faint crane May 23, 2024, 6:15 AM

#

It's just the browser really. Need to see if any GPU debuggers work.

#

Plan to shoehorn it into WebGL where I'm really limited.

wicked notch May 23, 2024, 11:08 AM

#

primal shadow If you want another point of comparison, my code is at https://github.com/bevyen...

do you give up after you can't simplify anymore without trying to reach one node at the root?

primal shadow May 23, 2024, 3:54 PM

#

wicked notch do you give up after you can't simplify anymore without trying to reach one node...

Yep

primal shadow May 24, 2024, 12:41 AM

#

wicked notch do you give up after you can't simplify anymore without trying to reach one node...

Why do you ask?

wicked notch May 24, 2024, 12:45 AM

#

because vertex welding is making me sad

primal shadow May 24, 2024, 12:54 AM

#

Ah. Why so?

wheat haven May 24, 2024, 1:28 AM

#

What kind of welding do you mean? Just de-duplicating? Or merging nearby vertices?

wicked notch May 24, 2024, 1:37 AM

#

primal shadow Ah. Why so?

uvs

wicked notch May 24, 2024, 1:38 AM

#

wheat haven What kind of welding do you mean? Just de-duplicating? Or merging nearby vertice...

merging nearby verts

primal shadow May 24, 2024, 1:55 AM

#

Yeah UV seams make things really difficult

primal shadow May 28, 2024, 6:36 AM

#

I'm starting on my blog post to accompany the Bevy 0.14 release notes for the new (experimental) meshlet feature. Maybe it's just my writing style, or the amount of info I have to cover and how much I hate writing, but it feels very dry :(. I'm hoping it's at least informative? Idk, only half done, and the asset processing stuff is the boring part and the section I don't have much to talk about, relative to the (not yet written) shader section, where I did a bunch of things differently from Nanite. https://github.com/JMS55/jms55.github.io/blob/a3c272a4563675c6888cfce6565784af1c552b0c/content/posts/2024_05_10_virtual_geometry_bevy_0_14/index.md.

dull oyster May 28, 2024, 6:52 PM

#

I would guess that "dryness" will varies per reader. Some people may prefer a straight-to-the-point article, and it might be easier to read for non native speakers.

Taking myself as an example, I try to write things as I would say them, but I tend to prefer focused articles if I want to learn a technique*. Furthermore, english is not my native language but I found your introduction easy to read and understand, and the other parts are straight-to-the-point (that does not mean they are hard to read nor understand!), as you write it in the article "in favor of just describing the algorithm itself". That's a complete valid reason, you already have a lot to write about!

runic surge May 28, 2024, 7:15 PM

#

primal shadow I'm starting on my blog post to accompany the Bevy 0.14 release notes for the ne...

Maybe add images to visualize various concepts?

#

Images are great at getting the point across and lets you re-reference them later if you need them

primal shadow May 28, 2024, 8:55 PM

#

dull oyster I would guess that "dryness" will varies per reader. Some people may prefer a st...

Thanks for the feedback. I'm glad it was understandable, I was worried just giving a brief description of things was not enough, but there's so much to write about 😅. Btw your English is great!

primal shadow May 28, 2024, 8:55 PM

#

runic surge Images are great at getting the point across and lets you re-reference them late...

Yeah I have a few placeholder TODOs for images, maybe more would be helpful

dull oyster May 28, 2024, 9:03 PM

#

If the description was too brief, you can always make a follow up article or updates to your article

primal shadow May 28, 2024, 9:08 PM

#

True

frank sail May 29, 2024, 8:28 PM

#

primal shadow I'm starting on my blog post to accompany the Bevy 0.14 release notes for the ne...

I liked the write-up. I found it quite understandable and didn't mind the "dryness" at all

#

Now I just gotta wait for the to-do sections hehe

primal shadow May 29, 2024, 8:39 PM

#

Thanks for the feedback! Working on it in parallel with actually writing new code 😅

wispy spear May 29, 2024, 9:13 PM

#

@wicked notch John Khronos is reaching out for help https://discord.com/channels/427551838099996672/1151556776379023452

wicked notch May 29, 2024, 9:14 PM

#

I literally read "John Khronos as a member"

#

my brain is truly fried

wispy spear May 29, 2024, 9:19 PM

#

hehe

#

man they create new shit every week

#

now new PBR tonemapper

#

3d commerce bs

#

anari nonsense

#

whatever that kamaros thing is

finite quartz May 29, 2024, 9:52 PM

#

wicked notch I literally read "John Khronos as a member"

Same, which made me think this was a wikipedia article about John Kronos life achievement...

primal shadow Jun 1, 2024, 12:57 AM

#

Terminology and 1 pass of my renderer written about for the blog post - 8 more passes and the future work section left 😅

dull oyster Jun 1, 2024, 10:34 AM

#

primal shadow Terminology and 1 pass of my renderer written about for the blog post - 8 more p...

About the first pass (I don't know much about Bevy's architecture): do you really have to compute the instance <-> meshlet connection every frame?
Can't you "just" compute this mapping once at load and then reuse it?

primal shadow Jun 1, 2024, 6:24 PM

#

dull oyster About the first pass (I don't know much about Bevy's architecture): do you reall...

At load technically yes, but I want to support dynamically adding and removing entities.

#

Unreal's solution is better though, I'm going to copy that. It's going to be part of the future work section.

#

Instead of dispatching per cluster, build a BVH over the lod tree, and traverse the tree per instance instead with persistent culling.

dull oyster Jun 1, 2024, 6:28 PM

#

primal shadow At load technically yes, but I want to support dynamically adding and removing e...

You could compute the mapping for the entity when it gets added, right?

primal shadow Jun 1, 2024, 6:32 PM

#

dull oyster You could compute the mapping for the entity when it gets added, right?

Removing entities makes it more difficult. Then you have gaps in the buffer and can no longer map cluster id to instance id just by indexing.

dull oyster Jun 1, 2024, 6:38 PM

#

Oh right

primal shadow Jun 4, 2024, 6:41 AM

#

Day ? of ??? of writing. Did half a shader, didn't even manage a full pass 😅. Tbf cluster culling is a lot.

primal shadow Jun 8, 2024, 10:26 PM

#

@wicked notch @dull oyster I've finished writing the first draft of my frame breakdown. Not published to my blog yet, but here's an early access link: https://github.com/JMS55/jms55.github.io/blob/meshlet-wip/content/posts/2024_06_08_virtual_geometry_bevy_0_14/index.md

I'd very much appreciate if you could give it a review! Criticism very much welcome. It's hard to know what other people will intuitively understand or what needs more explanation when writing in general, and this is an extremely technical topic that I tried to avoid spending 10 years writing 😅 .

wicked notch Jun 8, 2024, 10:46 PM

#

I'll give it a read later tonite

velvet marsh Jun 9, 2024, 12:20 AM

#

that's pretty incredible, thanks for writing that, that's immensely valuable to learn how you implemented virtual geometry and some of these concepts, I hadn't heard of a depth pyramid before or virtual geometry

dull oyster Jun 9, 2024, 10:18 AM

#

primal shadow <@320895822394818561> <@163242187084005377> I've finished writing the first draf...

First off, congrats on writing about the entire pipeline!

Overall I think the article is very interesting.

I was on mobile so I could not check the code easily but here is my feedback:

"In the example with the wall with the rocks and trees behind it, we could see that last frame the wall clusters contributed pixels to the final image, but none of the rock or tree clusters did. “
Maybe I missed something but I could not see this example?
I like the discussion of why a single indirect draw was chosen
Writing different vertex indices for the material pass is smart

wispy spear Jun 9, 2024, 10:22 AM

#

something which could/should be presented at GPC in november 😉

#

@dull oyster + @wicked notch + @delicate rain https://www.graphicsprogrammingconference.nl/?v=4 dunno if @primal shadow is from europe too hehe

Graphics Programming Conference

Breda University of Applied Sciences

Graphics Programming Conference

Graphics Programming Conference (GPC) is a new, international conference for real-time and interactive graphics programming.

dull oyster Jun 9, 2024, 10:25 AM

#

November? Not sure I'll be very available 😬

wispy spear Jun 9, 2024, 10:26 AM

#

calling for papers ends at end of june (can probably be extended)

#

November 12-14, 2024

#

just saying

primal shadow Jun 9, 2024, 5:53 PM

#

dull oyster First off, congrats on writing about the entire pipeline! Overall I think the a...

Maybe I missed something but I could not see this example?
It was two paragraphs ago, the first one in that section
"I mentioned earlier that frustum culling is not sufficent for complex scenes. With meshlets, we're going to have a lot of geometry in view at once. Rendering all of that is way too expensive, and unnecessary. It's a complete waste to spend time rendering a bunch of detailed rocks and trees, only to draw a wall in front of it later on (overdraw)."

Glad you liked it!

primal shadow Jun 9, 2024, 5:54 PM

#

wispy spear <@163242187084005377> + <@320895822394818561> + <@226726721133477888> https://ww...

I'm American 😅 . But also not sure I've done much unique, it's msotly copied from Nanite.

dull oyster Jun 9, 2024, 5:56 PM

#

wispy spear November 12-14, 2024

Now that's the trailer's out I can answer: I will be too busy shipping Flight Sim 2024 to participate or event attend GPC 😢

wispy spear Jun 9, 2024, 5:59 PM

#

dull oyster Now that's the trailer's out I can answer: I will be too busy shipping Flight Si...

i can send a mail to your manager if you want 😛

primal shadow Jun 9, 2024, 9:18 PM

#

It's published now: https://jms55.github.io/posts/2024-06-09-virtual-geometry-bevy-0-14

Virtual Geometry in Bevy 0.14

velvet marsh Jun 9, 2024, 9:37 PM

#

keep learning new things on reread, frame budgets!

#

also using previous frame data to do occlusion culling is something I've been reading about elsewhere, it's super interesting, and I have no idea how to do it yet

frank sail Jun 9, 2024, 9:48 PM

#

didnotread but typically it would be that you produce visibility of objects for the next frame

#

so 1 bit per object that you write to this frame and read from the next frame

faint crane Jun 10, 2024, 3:43 AM

#

velvet marsh also using previous frame data to do occlusion culling is something I've been re...

Read about 2 pass occlusion culling.

primal shadow Jun 10, 2024, 5:12 AM

#

velvet marsh also using previous frame data to do occlusion culling is something I've been re...

This is the article I recommend https://medium.com/@mil_kru/two-pass-occlusion-culling-4100edcad501

Medium

Two-Pass Occlusion Culling

Occlusion Culling is an optimization technique used to improve performance by skipping the rendering of objects which are hidden by some…

#

The only thing is it suggests using int mipLevel = floor(log2(max(AABB.pixelWidth, AABB.pixelHeight)));

#

I had bugs with that. I instead used max(0, ceil(log2(max_width_or_height))

#

You also don't have to store an explicit 0/1 visible bit between frames. I think it's easier to use last frame's depth pyramid.

velvet marsh Jun 10, 2024, 5:15 AM

#

thank you

primal shadow Jun 10, 2024, 5:15 AM

#

np

wispy spear Jun 10, 2024, 8:18 AM

#

you could add +1 to the floor one

ebon ruin Jun 10, 2024, 1:25 PM

#

Who’s thread is this?

wispy spear Jun 10, 2024, 1:30 PM

#

lvstri's

ebon ruin Jun 10, 2024, 1:35 PM

#

it looks like he posts 4 times a month lol

faint crane Jun 10, 2024, 1:44 PM

#

This is the lvstri fan fan club.

delicate rain Jun 10, 2024, 1:45 PM

#

We all gather in the hopes of LVSTRI showing up

#

Still didn't get my gpu signed by him

wispy spear Jun 10, 2024, 2:02 PM

#

lustri is also neck deep in uni shit

delicate rain Jun 10, 2024, 2:02 PM

#

I am officially done today

#

passed the last exam two hours ago 😄

#

Time to get back to the VSM struggle froge_bleak

wispy spear Jun 10, 2024, 2:49 PM

#

woohoo

velvet marsh Jun 10, 2024, 3:42 PM

#

Congratulations!

delicate rain Jun 10, 2024, 3:44 PM

#

just one more semester

#

last one bro I promise

wispy spear Jun 10, 2024, 3:44 PM

#

hu?

#

you said you are done ange

delicate rain Jun 10, 2024, 3:50 PM

#

ah sorry, done with this semester

wispy spear Jun 10, 2024, 3:50 PM

#

: )

delicate rain Jun 10, 2024, 3:50 PM

#

3 months free-ish time looking for temp job

#

and then one more

#

kekkedsadge

wispy spear Jun 10, 2024, 3:50 PM

#

so they say

wicked notch Jun 10, 2024, 4:09 PM

#

wispy spear lustri is also neck deep in uni shit

It's almost over

#

soon™️

wispy spear Jun 10, 2024, 4:10 PM

#

6921frogtea

ebon ruin Jun 10, 2024, 4:58 PM

#

well

#

lvstri follows my thread

#

so ill follow his

#

he seems cool

wispy spear Jun 10, 2024, 4:58 PM

#

he is

wicked notch Jun 10, 2024, 5:55 PM

#

welcome to nanite HQ

delicate rain Jun 10, 2024, 6:15 PM

#

Piece by piece we reassemble unreal engine 5

#

And once we manage it, ue6 will release, featuring a fully ai driven rendering pipeline

#

And we switch to being embedded developers

primal shadow Jun 11, 2024, 7:15 PM

#

https://youtu.be/yD5DjNyiYIQ?t=292

#

So it was pointed out to me that Tencent made a Nanite-like system on mobile

#

They mention they used 32 bit atomic buffers for software rendering

#

But this has got to be one of the least informative GDC talks I've ever seen. They explain nothing

#

So frustrating. I really want to know how they made 32 bit atomics work

faint crane Jun 11, 2024, 7:59 PM

#

Cool blog. https://www.jendrikillner.com/post/graphics-programming-weekly-issue-343/

storm wolf Jun 11, 2024, 8:36 PM

#

i was under the impression that nanite used atomics as well

#

is there another way for software rast?

wicked notch Jun 11, 2024, 8:40 PM

#

perhaps, it depends on how you want to tune your virtual geometry

#

if you don't need pixel sized triangles then you could entirely skip the soft rast

#

otherwise atomics are kinda necessary because you need to min on depth

storm wolf Jun 11, 2024, 8:41 PM

#

is there some trick other than min/max and output - i cant imagine it is too complex

wicked notch Jun 11, 2024, 9:17 PM

#

hmm the objective in the end is to write depth in some way

#

how would one emulate the depth test without min?

frank sail Jun 11, 2024, 9:21 PM

#

per-pixel linked list of fragment outputs, then sorting

primal shadow Jun 11, 2024, 9:40 PM

#

storm wolf is there some trick other than min/max and output - i cant imagine it is too com...

Someone came up with one in the bevy discord.

Atomic read cluster+triangle ID (aka visbuffer ID) from 32bit buffer
Atomic min the depth in a second 32bit buffer
AtomicCompareAndExchange writing the new visbuffer ID, and comparing using the visbuffer ID from step 1
Retry if exchange fails

#

It was something like that, I forget the exact details

glass sphinx Jun 11, 2024, 10:06 PM

#

@primal shadow do you also stream and store partial lods in this impl?

wicked notch Jun 11, 2024, 10:18 PM

#

primal shadow Someone came up with one in the bevy discord. 1. Atomic read cluster+triangle I...

so basically a cas loop bleakekw

primal shadow Jun 11, 2024, 10:20 PM

#

wicked notch so basically a cas loop <:bleakekw:1082598350303539240>

Yep

primal shadow Jun 11, 2024, 10:22 PM

#

glass sphinx <@145540119141679105> do you also stream and store partial lods in this impl?

Nope. Streaming is long term. Mostly because there's other more important stuff, and there's a lot I haven't figured out yet:

Do I store vertex data per meshlet? Rn I store it per mesh and each meshlet references it
Do I need to wait for transfer queues? Is using one main queue and blocking rendering work while mesh data gets transferred ok?

#Iris - A Journey through OpenGL and beyond to learn Graphics