#Iris - A Journey through OpenGL and beyond to learn Graphics
1 messages · Page 23 of 1
We achieve this by performing quantization in object space using a user-selectable
power of step size centered around the object origin.
It is crucial that the step size is not normalized to the bounds of the object or in other
ways tied to its dimensions.
From the nanite siggraph presentation
So yeah, feels a bit contradictory
I assume they just choose a step size based on how big the object is, under the assumption that meshes that will be used together should have similar sizes, and therefore the same size-based heuristic should lead to the same quantization factor, and it'll work out fine when the meshes both get used with transforms multiples of the step size
memory allocation of 4548506711262943144 bytes failed uh oh
memory allocation of 4548506711262943144 bytes failed
[Inferior 1 (process 209) exited with code 011]
(gdb) bt
No stack.
(gdb)
wtf
how many exa bytes is that
idk what went wrong :((
oh whoops, wasn't my code's fault
forgot to copy paste some asset writing code into my testing setup
I mean what are you surprised about
it exited
normally
if you were to crash it as like with abort(), then it'd work
Nooo, I finished my shaders for decoding, and all it's rendering is points D:
hmm, so every vertex has the same position, huh...
oh, LOL
totally forgot to use the vertex_id parameter of my get_meshlet_vertex_position() function 😅
that would explain it
oh god
perfectly shippable
it's a buggy
bugs bunny!
Bleh I can't figure out how to fix it, as renderdoc seems to be showing my fake values in the debugger :/
artifishial paint simulator, i like it. that blue bunny on the right ackchually looks quite cool
i will steal the image and use it as the new server banner : )
Ah I have discovered my issue, fml
the bitstream reader I did on the GPU does not account for when the bits of a vertex are split across two different buffer elements :/
People in #webgpu helped me out, think I got it working
Shading looked a bit off though, I think I'm compressing normals too much
I'm not compressing UVs, but so far per-meshlet buffer is not a size savings, at least for the bunny mesh
I guess the triangle compression I can do should improve it more
And streaming will give me memory savings
compressing UVs should be safe from f32vec2 -> u32 (packed 2 halfs)
I've heard very bad htings about 16bit UVs
I mean it depends on the asset 
Ok yeah, using octahedral encode, and snorm2x16 is too inaccurate
Wait am I supposed to be using snorm, or unorm 🤔
What's the outpout of octahedral encode?
[0,1] or [-1,1]?
@wicked notch Mega Lights 2.0 when?
i have no idea, only saw the summary at gamesfromscratch
It's up on their github. It's RT based, but no ReSTIR surprisingly.
Interesting
It's still in the research phase of finding the best solution (best tradeoffs for our use cases) and things may change, but at the moment there's no ReSTIR in MegaLights.
Stochastic light sampling doesn't really say much
But yeah makes sense, thank you!
Pray I do not quantize you further
but how do they RT properly with nanite meshes 
fallback repr
that's not 'properly'
that's cringe
incredibly cringe
so cringe that i'm probably gonna do fallback repr myself
That's what I want to know
it is what unreal does
you might not like it
but unreal doesn't care that you don't like it
but what if I ping the entire epic games developer github org
@primal shadow just fixed another silly mistake in the edge detection, maybe try it now 🙃
Hello Nanite folks
my meshlet generator sometimes creates split meshlets where a meshlet would be in two pieces (that can be quite far from each other). Will this be problematic when implementing the LOD tree?
yes
i see, thanks
Will try later tn
tomorrow*, ended up getting paged for work
Question for people: how are you generating the LOD bounds for the group?
not for individual meshlets
for base groups, merge the bounds of the meshlets (or just calculate it directly)
then merge the group bounds of all meshlets for higher LODs
Oops forgot to update it here, I figured it out
Yep that's what I figured out was correct
Except apparently getting a minimal bounding sphere of bounding spheres is a 183 page thesis xD
There's an open source implementation in the cgal library, but it's gpl v3 :/
So guess I'll go with an approximate method
yep i just merge two at a time 
Did some asset size / quality / perf comparisons between compressed per-meshlet vertex data, and a single set of uncompressed vertex data shared between all meshlets. Sadly asset size is nearly identical. On the upside, quality and perf are also identical, I can implement streaming now, and there's still room to further compress vertex data so more wins in the future hopefully. https://github.com/bevyengine/bevy/pull/15643#issuecomment-2395198350
UVs are completely uncompressed (64 bits), normals can probably be quantized and variable-length encoded similiar to positions rather than the current 32 bits per vertex, and triangle data can be compressed with fancy triangle strip encodings (currently 24 bits per triangle, 8 bits per index * 3)
Do you have a link handy? Can procrastinate further on occlusion culling give it a shot.
oh yeah that reminds me that my culling is a wee bit broken along the bottom and right edges of the screen
I'm realizing I have no idea how nanite's error projection is supposed to work
Each cluster needs: bounding sphere of it's group, and of it's parent group
and then, parent group boundign spheres must strictly encompass all of their child bounding spheres
but then how do you mix error into this setup? Where does the error come in during the building and runtime steps?
I suppose the test is group_can_be_rendered = projected_sphere_radius(group.center, group.radius) < group.error_radius
I.e. the group has error_radius deformity from it's children, so if the size of the group on screen is less than that, it's basically equivilant to it's children, so it's ok to render
no, that dosen't seem quite right either
yeah me neither
it's a pain
what i was originally doing was placing a sphere on the closest point of the lod bounds and checking it's projected screenspace radius
(and clamping to the camera)
but that leads to holes for some reason
now i do this
but i forgot what it's actually doing 
and it sometimes leads to double-rendering i think
(the comment is wrong)
(please tell me if you come up with something better)
Based on https://vcg.isti.cnr.it/~ponchio/download/ponchio_phd.pdf 3.6.1, it sounds like we should:
For each cluster, store culling bounding sphere, lod bounding sphere, and error
Leaf meshlets (i.e. initial set of starting meshlets): Generate the culling bounding sphere, lod bounding sphere is a copy of the culling bounding sphere, error = 0
And then you group meshlets, simplify the group, and split into new meshlets
And for each new meshlet: compute culling bounding sphere, lod bounding sphere = new sphere encompassing lod bounding sphere of all children in the group, and error = max(simplification_error, child1_error, child2_error, ...)
Ok so that's building, now you have for each cluster: culling sphere, lod sphere, and error
now at runtime you gotta do this
tight bounding sphere = meshlet cluster bounding sphere
...and this runtime part I'm still reading the paper to figure out
but anyways you do this, and also for the parent sphere somehow?
and then you draw if self == good && parent == bad
finding the minimum enclosing ball of points (for leafs) and minimum enclosing ball of balls (for
all other nodes) [35]
Oh hey, they reference fischer's thesis, ok cool so I was on the right track with that
it seems like you're supposed to take the cluster's bounding sphere, find the closest point on the surface to the viewport, and then project a new sphere where the center is that point, and the radius is your error
so that tells you whether or not the current cluster has visible error, but what do you do about the parent???
And when you're building a BVH like nanite does, you can't involve the cluster bounding sphere at all, it has to be based solely on group/LOD data
So nanite must be doing something different here
really I think my original idea I've been using for the past year is on track
the cluster bounding sphere dosen't matter
what matter is for the cluster's group, and the cluster's parent group (group with cluster in it before simplifying),
given the LOD bounding sphere (located somewhere), with radius = group error, is the projected size of that group small enough such that the error is invisible?
The problem is, where do you choose to locate that sphere?
that's really the key question
because if you're saying radius = error, and you force error to be monotonic, then the bounding sphere projections will always be monotonic if they're located in the same spot
the issue is if you start moving where the bounding spheres are, then you run into issues
so where the heck do you choose to locate it??
I think the way Nanite does this is not neccesairly straightforward sphere projection
You have the group bounding sphere (encompasing all child group bounding spheres), and the group error of the cluster
And then you somehow project that error to the screen using the group sphere bounds
but it's not projecting the sphere itself? Something like that
Mayeb it is this
tight bounding sphere = group bounds
and then you find the closest point on the group bounds to the viewport, make a new sphere centered there with radius = error, and then calculate projected size of that sphere?
that is using the saturated sphere (lod bounds) and placing the error sphere on the closest point on that, and then projecting it to the screen
unfortunately it doesn't work if you're inside the lod bound sphere
or it doesn't work with a bvh, idk
didn't work for me when i tried it
I think you can probably just force LOD 0 at that point
they get quite large for things high up in the bvh
so there's holes in the mesh when the camera is inside one lod bound but not the other
or something like that
instead i calculate the distance that the error would be less than a pixel and check if the closest point on the lod sphere (or something like that) is closer or farther than that
What meshoptimizer's nanite demo is doing is returning infinity when inside the LOD sphere. That way infinity is never <= threshold, and therefore that LOD never gets selected.
Forcing you to pick a finer LOD
Think I'm going to try that with projecting an error-radius sphere on the closest point on the LOD sphere
yeah I think I'm doing something similar to what zeux says nanite is doing here
but it's buggy so I might need to revisit that lol
https://github.com/bevyengine/bevy/pull/15643#issuecomment-2398801204 big win for memory usage!
LZ4 was somehow doing a shit ton of work before considering before/after asset size with LZ4 compression applied is basically the same
But memory usage is nearly halved after
Still on my backlog dw. Currently doing some changes to error projection and bounding spheres which should both improve LODs but also allow my converter code to work on larger meshes that it was crashing on before. After that, I'm going to go back to tweaking the builder code, and add the manual edge locking, larger meshlet groups, and probably attribute-aware simplification.
Well this clearly didn't work
Much better, but I think it's vastly over-estimating error 😅
you COULD start a side business and sell those as contemporary art installations
I think this is what I was describing that isn't always monotonic?
but which paper is that
fn lod_error_is_imperceptible(sphere: MeshletBoundingSphere, error: f32, world_from_local: mat4x4<f32>, world_scale: f32) -> bool {
let cp_world = world_from_local * vec4(sphere.center, 1.0);
let r_view = world_scale * sphere.radius;
let cp_view = (view.view_from_world * vec4(cp_world.xyz, 1.0)).xyz;
// TODO: Handle view clipping / being inside sphere bounds
let aabb = project_view_space_sphere_to_screen_space_aabb(cp_view, r_view);
let screen_size = max(aabb.z - aabb.x, aabb.w - aabb.y);
let meters_per_pixel = sphere.radius / screen_size;
return error < meters_per_pixel;
}
Not documented and poorly named atm, but this
Take LOD sphere, project to screen space to get the pixel size
Then you do sphere.radius(?) / pixel_size
I.e. if you sphere has radius 10, and your pixel_size is 4
you get 10/4 = 2.5
E.g. 2.5 meters = 1 pixel on screen
And error is already an object-space distance in meters
So now if e.g. error = 3.2
Well 2.5 megters = 1 pixel
So 3.2 meters on screen is greater than 1 pixel
I.e. visible error
So it's sufficent to check that error < meters_per_pixel
I.e. error needs to be less than 2.5 meters so that it's less than 1 pixel on screen
Since the relation between meters and pixel on screen is linear
I do need to handle clipping when inside the sphere bounds though. The paper covers it.
I'm not quite convinced on some of this though. And it feels really weird to project the LOD sphere and then compare the size to the simplification error, rather than projecting the simplification error directly.
the screenshot you sent above mentions something about comparing error directly with distance * some threshold 
this code looks different
Yeah I didn't follow it
Because I have no idea how to handle the case where you're inside the sphere for that
The one from the batched multi triangulation paper
I also could not find code or the algorithm description for it at all, I'm giving up on that approach
could you send a link to it?
i can take a look and try to figure out what it is
https://d-nb.info/997062789/34#page=48 sections 3.6.1 (specifically figure 3.15), and 4.2.3 on page 60
thanks!
it's not for perspective projections

Can you help me understand what zeux is saying here then? https://github.com/zeux/meshoptimizer/discussions/783
UE5 Nanite computation is similar in principle, but mechanically different - it takes into account cases where the sphere is clipped by znear, and it takes camera orientation into account, so the computation is not camera rotation invariant. Conceptually I think it's the same as your reference to fig 3.15, even though I find that specific figure odd as no lines or points there connect to sphere radius 🙂 they compute the projected sphere radius in pixels, invert it (that way they get the length that would project to one pixel, using the same coordinates that the sphere is in), and compare that to the error (which is also in linear units in the same coordinates that the sphere is in).
hmm
from what i understand, it's the same thing you're doing
but the inversion is probably more complex than just a div
it could also be calculating the distance at which the error becomes less than a pixel, and just comparing the closest point on the sphere with that
but the distance for 1px error depends on where the sphere is 
so there could be some normalization step to convert from an off-center sphere to a sphere in the center
I have no clue
Am I insane or does zeux make it sound like it's linear?
Ok lol so I forgot to multiple 0..1 by the view size 😅 , looks better now
Zeux also left me some more info, so I'm going to take a look at that too
Ok I'm just stealing zeux's code
I give up trying to understand this
Not sure I handled ortho correctly but yeah
Is orthographic even useful with a Nanite implementation?
I guess reviewers will complain anyway, even if w=1.
Shadow maps 🙂
@fiery bolt what are you using for your error projection? Seems like you're using a method n don't understand based on the projected bounding sphere
i have no idea tbh
i did geometry on paper for it
idk where it went

it's not correct tho
Hmm ok. Back to builder improvements.
you should multithread simplification
No time for that, bevy release is very soon
Plan is to finish stealing zeux's error projection, steal some of the simplification improvements he had, write a hopefully faster and easy fill cluster buffers improvement, and then maybe fix SW raster if I have time
And then write the blog post for everything I did this cycle and help out with the rest of the release
Oh btw do you have a gltf -> virtual geometry mesh converter?
I'm curious how you're handling materials
as of now, by not
i don't have a renderer
it's just visbuf output and debug
right i forget you have users 
"forgot you had users" is the most GP thing ever
no no we do it in the rust server too
specifically #games-and-graphics and # lang-dev
no that's not what i wanted discord
discord moment
I claim ownership of these channels
they now belong to GP inc.
surrender or else 🔫 🐸
for that you must shill rust 
the first rule I'll implement is ban rust
along with a healthy dose of offtopic cat gifs
that is allowed
So focused on the CPU language, you're missing the real issue of the GPU language smh
you're outnumbered here actually 
I didn't know there was a difference 
mfw this is my own channel and I'm outnumbered
I have been playing with vcc tho
without telling gob
I managed to make a transpiler
from clang to glsl
i should try to re-derive or figure out the error proj math in my code tomorrow 
because I didn't feel like reading the spirv spec
and figure out why it doesn't work sometimes and only sometimes
the spirv spec is surprisingly good ngl
i'm currently on my like
6th
probably 6th renderer too
but i think i passed my final interview for an internship at e🅱️ic despite my serious lack of braincells and knowledge 
just maybe
nono it's for lang dev not giraffics
same thing, your soul will be sucked dry in exchange for monetary compensation
it's been 2 whole days since my interview
they've never taken more than a day to respond 
terminally online, just like us frfr
i mean yeah it's biscuits and tea country
so it can't be america levels of bad 
in return i will get 30% the pay tho
(there are several banks offering the same pay for software dev interns as tesco shelf stackers)
oh and tesco themselves actually
Work at NVIDIA at that point and take the comp.
Ok error projection finished. TODO:
- Builder improvements from zeux
- SW raster subpixel precision + top left rule
- New fill cluster buffers
Alright, manual vertex lock time
Iwant to do some lodding soon. I ll have to catch up here
how do you build the lod hierarchy levels?
I can only point you here:
https://jms55.github.io/posts/2024-06-09-virtual-geometry-bevy-0-14/
https://jglrxavpok.github.io/
- split into meshlets
- group meshlets into groups of 8-16
- lock boundary vertices on those groups
- simplify groups
- split groups into meshlets
- pool all meshlets together and repeat
where nanite slides smh
- My current code (still making some improvements litterly today): https://github.com/JMS55/bevy/blob/42617d4abc6ec6ac4fba5c24db84d7e1f60666b5/crates/bevy_pbr/src/meshlet/from_mesh.rs#L63
- Meshopt's demo: https://github.com/zeux/meshoptimizer/blob/master/demo/nanite.cpp
- Nanite slides: https://advances.realtimerendering.com/s2021/Karis_Nanite_SIGGRAPH_Advances_2021_final.pdf
- Traverse research: https://blog.traverseresearch.nl/creating-a-directed-acyclic-graph-from-a-mesh-1329e57286e5
do i have to meet some criteria for those 8-16 groups?
meshlets in them should be touching or as close together as possible
I read the nanite slides but i still dont understand the hierarchy building]
If you use my blog post, I wouldn't copy the code exactly, make sure to reference mine/meshopt's code for the up-to-date changes. I have a new blog post coming in ~2-3 weeks that will be up to date.
how does the simplify groups step work?
is it N tris to N/2 tris?
https://github.com/SparkyPotato/radiance/blob/main/crates/asset/src/mesh/import.rs
this is my current code
yup
and you keep track of error introduced
and set that as the parent error for the unsimplified group, and self error for the simplified meshlets
which reminds me i should write a blog post on error once i figure it out correctly 
Take ~8 meshlets, merge them into one big triangle strip, use meshopt to simplify into a new triangle strip, break apart back into new meshlets (if you simplified by ~50%, then hopefully you end up with ~4 meshlets)
and how i did the bvh
how do you guys do the hierarchical culling? With persistent threads?
nah, i use an indirect dispatch chain rn
it's fast enough
nanite uses a dispatch chain on pc too
but didnt they say they dont?
that's on console
I just do a brute-force dispatch over every cluster in the scene across all LODs atm. I'm planning to switch to hierchal culling in a bit, once I'm done with my current round of improvements to other areas.
because it's technically UB
you because you don't wanna debug that shit
if even UE uses indirect dispatches, i will too
i might try persistent threads after i have streaming working
workgraphcs cant come soon enough
or just use workgraphs yeah lol
sad that they have such horrible perf on nvidia
meshoptimizer can build lower poly versions of a mesh?
i wonder if i can abuse DGC to only conditionally insert an indirect dispatch
Yeah.
Try their demo, they have a bunch of configurable stuff
Or download bevy and play around with it
if i use coherent writes i don't think i'll need barriers?
oh shit what, meshopt has a nanite demo
yeah it's new
I linked it above 😛
yee just saw that
or download my shitty code and play around it 
i have an editor too
True, Bevy dosen't yet
i need add a way to actually spawn meshes though lmao
you can only select and move things rn 
oh and undo
hm the demo doesnt seem to have a cmake target
https://meshoptimizer.org/demo/ oh shit im dumb
i cant interact with it
i think i might wanna try the hierarchical lodding at some point
but i definetly dont want the partial streaming part. Streaming full lods is way less headache i think.
i saw some analysis and nanite mesh streaming is a major cause for stutter. Needs much more pcie bandwidth then tex streaming for some reason
is yours c++?
nope
The demo isin't interactable. You can have it dump data to an obj and open in blender though.
was it a threat interactive analysis 
ue has lots of problems like that causing stutter
kinda a shame. Their shader comp stutter is only getting worse with time too
traversal stutter my beloved
you can make UE compile shaders on start though, can't you
not really, no
they have made it fundamentally kinda impossible with their materials
have to pat the back of cod team on this one. I think cod engine is probably one of the least stuttery engines. Everything is prebuild and static, even the rendergraph.
e🅱️ic should hire us to fix their stutter
they dont care about such things
There are also these repos that are worth looking into (C++)
https://github.com/jglrxavpok/Carrot/tree/rendering-improvements
https://github.com/daniilvinn/omniforce-engine/tree/meshlet-lods
maybe threat interactive was right all along
should i do streaming or restir or should i fix my shitty code first
no
parallelize
what are you unreal engine that does everything on a single thread?
grow 4 more hands and buy two more keyboards and mice
Lighting is so much harder, don't start that if you haven't done it before until you finish virtual geoemtry 😅
I also have a partially written blog post on restir I never finished, I should do that...
Cross platform DGC is a thing now
they dont solve this problem afaik
Idk what problem you need to solve
I thought you just needed a variable number of indirect dispatches
for hierarchical culling youd ideally just start new threads in the same dispatch
ah actuallynow that i think of it
Is it possible to start new dispatches immediately?
i think it has to flush and then do an execute indirect on the shader recorded command buffer
so it will have to run in passes still\
with DGC you would figure out the deepest level of the hierarchy and create the DGC commands
how would you figure out the deepest hierarchy level without doing all the work
you only know what meshlets you need after each cull phase
no you only need to test the error
which admittedly is a lot of work
actually yeah dgc doesn't solve the issue 
it would also be very poorly parallel
ye
honestly dgc doesnt really do much at all imo
i dont really see a use for it in my things
DGC is really just a budget vkCmdDispatchIndirectCount
I mean I've done traditional RT before
Reading the spec for dgc hurt me ngl
why it's not that bad
It reminded me of work graphs how you have to create a bunch of shit first
ye you have indirect token layouts and command tokens
Or what feels like it from reading
which is a bit sad, but the layout only says which commands you're gonna use
and how many of each of them
it's still this
only more convoluted 
it does have the added benefit of being able to also issue draw commands
and pipeline state changes
Budget vkCmdDrawIndirectCount 
maybe this will make UE not issue a drawcall per material
Press x to doubt
vkCmdSetIndirectScissorsCount when
Real time though? It is very difficult to not have noise 😅
As opposed to?
issuing a drawcall per material?
uhhh I'll figure something out maybe it can't be too hard right

Hahaha yeah....
how about you catche deez nuts instead
Then you have slow lighting response. Also that's even harder than per pixel fun fact.
ye you do the AMD memes
Getting your cache to bend around corners and crap sucks
with screen space cache and world space cache
While not leaking light
just do whatever ReSTIR does
Speaking of caching gi-related things... 
shhh
RTX DI uses a neural net trained as you do inference in real time as their radiance cache, so hf with that
I have this as my alibi

idk what these words mean so they're not a good enough alibi
I like your funny words, pasta man
I cooked pasta today
did crimes 
the spaghet wasn't fitting in my pot so I broke it before putting it in
/s
real
Manual vertex locks are a great improvement!
Before:
https://cdn.discordapp.com/attachments/148468683998625792/1294885657310920744/image.png?ex=670ca3be&is=670b523e&hm=c17d3497e0b94cc1eff5ab4344575a13d0383ada496409d2b13e063746aab87a&
After:
https://cdn.discordapp.com/attachments/148468683998625792/1294886356039897128/image.png?ex=670ca465&is=670b52e5&hm=b4e5d06c0fa046596c9933c17a39d89ebea230229b1f62adb38a685a232dc339&
Too much triangle cruft at the intersections still. Hopefully retrying stuck clusters in later passes works.
1 bunny = 1 meshlet though!
so i seemed to have fixed my occlusion culling
well, almost
there just seems to be a band of under-culling at the edges for some reason
I fixed an occlusion culling bug today where my bounding spheres were wrong since I referenced i instead of j somewhere in a nested loop or something to that effect.
how does that work ? based on shady or what
love to hear it
hopefully recent commits didn't fuck your shit up too much
it's been refactor time for a while

welp
but also having functions with super common name not namespaced whatsoever is a recipe for disaster
i'm prefixing everything with shd_ or _shd_
that's v nice
I have a shader compiler called Slim Shady, but not the real Slim Shady, or death to Slim Shady.
shid
the one who says, is
is that the european version of "whoever smelt it dealt it"?
I'm gonna
Finally figured out why my renderer broke with 1042 instances
I overflowed the 2^25 cluster limit...
That was so awful to debug
is that cause you have all cluster instances always present
?
cause you should lod away most of them right?
or os that 2^25 after culling 
Yeah :/
It's pre-lod/culling
Ideally this becomes part of the culling/lod pass and we only write out data for the meshlets that we intend to raster...
I need hierachal culling first. Also culling is the bottleneck atm, so for that reason too 🙂
do you take contributions?
Sure! I would love help, there's so much to do 😅 . I have a whole github issue on things I need to improve. You're welcome to take up anything. Just talk to me first, so that we're on the same page.
hierarchical culling is great
I spend like a constant 0.4 ms on culling iirc
completely unoptimized
What kind of culling? Multiple dispatches? And what are your inputs/outputs?
What are your inputs/outputs bteween dispatches though? I'm curious how it's set up.
E.g. rn I have:
- Fill cluster buffers: Input list of instances, write out clusters (instance + meshlet IDs)
- Culling/lod: Input list of clusters, write out visible clusters IDs
ah
it's pairs of instance and bvh node IDs yeah
and then instance and meshlet IDs to meshlet cull and output from meshlet cull
Gotcha gotcha. Thanks.
Sigh, maybe it's finally time to try and fix my occlusion culling
It's gonna suck to debug though
yeah it does
the padding to 64 with granite's HZB seems to work btw
still not perfect though
I have a theory: occlusion culling is permanently broken.
Or all of us here are cursed with slightly broken occlusion culling forever.
my occlusion culling always the issue of culling small but visible triangles
mine functions as frustum culling 
mine looks like it works but there's a very small border along the bottom and right edges that has less culling for some reason
Mine is slightly not conservative when I disable Hi-Z.
Surely, we'll encounter enough bugs to converge on something which works?
how do you build the hiz
thats really smart i didnt think of that yet
tho that doesnt scale over 4k
uhh I stole granite's HZB gen
but yeah it almost works but not perfectly
What needs padding? Input depth texture? Or output?
I'll have to reference your code
input. I think its single dispatch hiz gen. Single dispatch hoz gen works in two passes of 64 x 64 downsamplings per workgroup
output
damn i did 
I wonder if the issue is due to me not passing the input though
how does padding the output help
changes mip dimensions
so you have space to store the extra data generated due to NPOT
but why does that work for 64 padd
why wont it break on the higher levels
those will still have the downrounded div mip sizes
this gave me an idea tho
ok
rn i just scale the depth image to pot then downsample single pass
but it feels dirty
Idk if that's conservative
it is conservative
why wouldnt it be
in the downsampling to pot each pixel reads 2x2 pixels of the original image. Read is using uvs so it should map as long as the pot image is not less then half the size
now im paranoid
the issue lies in mapping coordinates from the screen to your scaled pot
because the mapping differs for each mip
i dont unserstand
the culling is using the pot image dimensions
i just scale 2560px1440p -> 2048x1024p for example with a 2x2 filter for each pixel and then downsample. The culling then uses the pos image dimensions
the culling doesnt need to use the screens dimensions. The mapping of pot image mips to original image potential mips doesnt matter after its downscaled
how do you map from screen pixel to hzb pixel?
you mean in the downsampling?
no, culling
how do take NDC AABB and sample from your hzb
oh you're... scaling down NPOT?
?
i just scale 2560px1440p -> 2048x1024p for example with a 2x2 filter for each pixel
this is definitely wrong
what, how?
why would i need to use screen dimensions in culling?
it doesnt matter at all what dim the culling tex has
yeah if your hzb is scaled so that the entire hzb maps to the entire screen
but how you do the mapping seems very wrong
do you have an overdraw debug view?
you seem to have a very flawed image of how i downsample
reading 2x2 doesnt mean there is an offset of 2 pixels for every out pixel
you calculate the uv in the dst image, then gather in the original image and max/min (depending on depth dir) them all
i have made many debug visualizations and tested many cases and the culling never breaks from what i can tell
i also had a visualization that draws the ndc for each culled object to see if its overculling (as visible ndc would mean if culled something thats actually visible)
but its not much to show as you just see no ndc 😮
this is culling off
hmmmm
highly recommend this btw
was a massive help to fix it initially
that would probably time out for me lmao
xD
I should do some stats though 
the debug utils in tido make up like 10-15% of all its code or so. But its so nice to have that stuff
turbo bikeshed
but also sanity
debug utils are insanely helpful
what helps also is drawn meshlet count. If it changes each frame its badbad it its consistent with no cam movement happy
like, one of the most important things when debugging
i ll polish up the debug draws for culling and show them later
yeah I have flickering shit in my overdraw debug view 
ooooh
idk why
still tryna fix that
the weird border and flickering are the two bugs left
works great otherwise
oh yeah also sw raster
just a wee bit broken
in that it instacrashes when enabled
i think it took me a few months of random insights to fully fix everything
i had a few vey hard thinking mistakes
im stopping ymself from doing sw raster
too much bikeshed went into the rasterization
its time for cool visuals now
usually when something is broken you don't add something else (which also turns out to be broken)

no
i build entire nanite
every component is slightly broken
then i try to debug

oh also i got the e🅱️ic internship 
time to slave writing code for money but full time now
pog
send some here to finance my stupid decisions (like buying the upcoming intel cpus)
intel pinky promises that these ones are safe
I didn't preorder or anything just to be safe
I'll wait for buildzoid to post his usual rants
buy zen 4 tho
which was already abysmal
zen 5 is just not worth it
ye
buy a 7950x3d 
zen 5 is a mess
or an 8950x3d if it has cache on both ccxs
I mean
have you seen the core to core latency
on zen 5 it's sometimes faster to read from ram than from cache (in another ccd)
ye there was a graph somewhere in #hardware that was absolutely funny
i think that was before the patch
turns out T_cache + T_fabric >= T_ram 
@fiery bolt for the BVH, what do BVH nodes equal? Some kind of grouping of clusters? Or of cluster groups? The nanite slides are vague on how the BVH is setup. Also how they enforce only 8 clusters per node.
so what I do is:
- leaf nodes are cluster groups
- for each lod, build a normal BVH8 using SAH, while also storing max parent error and merged lod bounds
- then build a BVH of the root nodes (no SAH because the AABBs are gonna be the same)
seems to work well enough
might wanna read the code tbh
and ask questions based on that
My software raster is broken and idk how to debug it 😭
can you use novideo aftermath?
yeah all it says is 'misaligned read'
and only when I touch groupshared mem
which is 
did you ask in the more public channels already?
#software-rasterization message
heh, i meant the metal guy 🙂
but these meshlets look neat and make me jealous
Join my project! I need more people 😭
i am mentally not capable yet unironically
i have successfully completely broken my error projection 
i'm also somehow crashing with an MMU fault when i shrimply index my output storage image with SV_Position.xy
how does that even happen
bro is finding bugs not even the hw knew it had
Whooh, fixed my SW rasterizer!
Had to force the HW rasterizer for near-clipped clusters, and add backface culling to the SW rasterizer
zeux massive as usual
Hello, I am collecting a list of nanite-related resources for my blog. If I've missed any of your projects, please let me know.
- https://advances.realtimerendering.com/s2021/Karis_Nanite_SIGGRAPH_Advances_2021_final.pdf
- https://github.com/jglrxavpok/Carrot
- https://github.com/LVSTRI/IrisVk
- https://github.com/pettett/multires
- https://github.com/Scthe/nanite-webgpu
- https://github.com/ShawnTSH1229/SimNanite
- https://github.com/SparkyPotato/radiance
- https://github.com/zeux/meshoptimizer/blob/master/demo/nanite.cpp
This makes it an absolute joy to use on web. Deleted so much WASM ... except for METIS. 
Don't forget https://advances.realtimerendering.com/s2024/index.html#hable from Denver which credited your blog.
Oh, I suppose I could. I wanted to link virtual geometry stuff specifically, but I'll see if I can find a section to stick that in...
Probably the roadmap actually, software VRS is something I want to experiment with
I updated to meshopt 0.22 and factored normal error into the LOD selection, so much better now!
Here's a fun fact from my days as an professor assistant
so I was correcting the practical exams of second year students for DSA
the tasks were
- Create a graph data structure, load the nodes from file and implement both DFS and BFS visits
- Write an algorithm to find the longest cycle in the graph
- Write an algorithm that determines whether there is a Hamiltonian cycle in the graph
now here's the funny part, the third task is NP complete but the professor somehow missed it 
most of the students were able to just write the bruteforce algortihm with backtracking, however the our uni's computers are extremely outdated and an n factorial algorithm isn't exactly the fastest
I was asked multiple times to solve np hard problems in college. A lot of them are easy, you just gotta brute force it
funny words
our uni computers have 7900xs... but the intel kind, and they just upgraded them all from 3080s to 4070 ti supers????
like pls can we get new CPUs too
An exam last year had a question asking to come up with a linear algorithm for something that was equivalent to SAT 
(It was an intentional trick question but still lmao)
You can throw mine up there 😉
https://github.com/lukasino1214/foundation
needs a social preview image 😛 (its in the repo -> settings -> "social preview")

i replaced shrimple 1 buffer -> 1 indirect queue with a complex dequeue and that fixed my software rasterizer?


Okay Nanite general
how come Bevy uses sphere bounds for cluster frustum culling?
It already has AABBs that it uses for occlusion culling
is the perf difference that big?
AABBs are usually better than spheres
but you need a sphere for error projection bounds
I use an AABB for frustum and occ cull, and spheres for error
spheres are faster for testing frustum
im not convinced it is
show code, maybe you have some clever way to make it faster than what i did
From what ive seen so far, the culling gains from it do not justify the extra overhead for higher meshlet counts
do note that I have BVH culling, not just instance -> meshlet
cute pfp : )
@wicked notch i reimplemented prefix sum + binary search now with devshs trick
its much faster than po2 buffers
damn
the overhead of many dispatches is massive on nvidia
that's op
honestly crazy to me
devsh made the point to me that the draw order is fucked with po2
should we report that though
idk it feels like a driver issue
idk their frontend always was bad
also im not sure but my new binary search is much faster than my old one
idk why i did my old one so badly
now my mesh shaders reach much better occupancy
i have the suspicion that multiple mesh shader dispatches have high overhead and cant share resources well
tyhe isbe memory might be contested between many dispatches
i give you the code
this will make vsms much much faster i think
nuking their overhead
they do 32x16 dispatches atm
that will go down to just 16
but this means that mesh shader with task shaders have much larger overhead for launches
than normal draws
but still way better than compute
kinda a middle child
i can kinda see that the dispatches start after each other
the later buckets are emptier
and with bucket launch the later part is just kinda empty
Hello, you all are getting early-access to my virtual geometry blog post for Bevy 0.15. Please read it and give me feedback! TODO: Compression section, perf comparison section, and images for all sections
RenderDoc does not actually show results from the GPU, it's all simulated on the CPU
pretty sure it replays commands, no?
if it was all run on the CPU it would be incredibly slow
and it also dies if i device lost in the capture
it uses transform feedback to get vs output, for example
I think the only thing it simulates on the cpu is shader debuggin
yeah
and shader debugging kinda sucks as a result
because other waves and workgroup threads are shrimply not simulated
Let me remove that part then, thanks. Not sure why my output was changing every time I clicked the dispatch then.
I guess even if it was running on the GPU, renderdoc kept re-simulating it
yeah it probably replays the entire command stream when you go backwards
because making a copy of each buffer for each command would eat up a lot of mem
looks good otherwise though 
there are many things that can trigger renderdoc to replay the frame
New text:
Debugging the issue was complicated by the fact that the rewritten fill cluster buffers code is no longer deterministic. Clusters get written in different orders depending on how the scheduler schedules workgroups, and the order of the atomic writes. That meant that every time I clicked on a pass in RenderDoc to check it's output, the output order would completely change as RenderDoc replayed the entire command stream up until that point.
like clicking on a different event
I don't think it's actually all for SW raster. Look at how much better culling got. It's probably 85% due to improving the DAG, and 15% due to SW raster.
15^3 stanford bunnies arranged in a cube
ah yeah culling improvements make sense
that's 'only' 236 million tris 
you should try the lucy scan
Unfortunately v0.14 can't render higher counts due to how I handle allocating some buffers
It runs OOM
So I need a test scene I can use on both
oh rip
I'm going to use the megascan cliffs and show perf for that too in 0.15
lucy is also great for testing out import perf and how good your simplifier is
because there's just so many tris
28 million iirc
lol
you'll need to multithread generation too
it takes me 18 minutes to import on an (amd) 7900x 
with all cores being hammered throughout
probably a good idea yeah
Collage is over?
Don't say masters 
yep
let's go
I think I had a reminder set for this day
where are you doing your masters?
How much of free time until you start masters
still deciding
till sept so plenty of time to catch up with nanitebros
Damn that's quite a lot
ye I've decided to take some time off to avoid actually dying 
Don't ever start on that in 5ish months I have school leaving exams
Computer science
Basically first and second semester of collage
You get all the jazz about CPUs, how memory works, electrical engineering, programming, databases and operating systems
do it at nanite uni
what is the nanite uni
#bikeshed-😇
lmao
wherever the italian nanite dude is i guess 
Blog post for bevy 0.15 meshlet stuff is almost done
I kinda wish I scrapped the idea and had just done a blog post or two on some specific parts, instead of everything. It's kinda a big mess of a post, but too late to change now...
I think the memory compression section came out well, but not so much the rest
@fiery bolt BVH for nanite = internal nodes point to cluster groups, and use AABBs based on cluster LOD spheres, and then leaf nodes point to clusters? Or is that wrong?
Oh I found this. Why do you build a seperate BVH per LOD? I'm trying to think if that accelerates common culling scenarios or something.
@wicked notch congratulazzione
leaves are groups, because they must always render together, internal nodes are just a normal SAH-optimized BVH built out of the leaves
I thought that since all LOD groups would be spatially near, if I built a BVH out of everything at once, multiple LODs would be parented by a single group, so max parent error would always be pretty high
so you're gonna be expanding a lot more nodes
Ahhhh that makes sense...
then build a BVH of the root nodes (no SAH because the AABBs are gonna be the same)
What does this mean? Litterly just take 8 random nodes, group, and repeat until you have a single root node?
I.e. if you have 16 LODs, pick 2 sets of 8 randomly to group, and then group the two sets once more
yeah pretty sure that's what I do
sorting by error is probably a better idea now that I think about it lol
For grouping the LODs? It probably barely matters right, it's only a few nodes
I made this diagram to explain how meshopt's LOCK_BORDERS flag works (#2)
And then I realized I have no idea how it works
So guess I'm not using it and just gonna skip explaining it lol
No idea how it preserves the meshlet borders if it's just going off of the topological border
Hello, I am once again asking for (this time final) feedback on my meshlet blogpost: https://github.com/JMS55/jms55.github.io/blob/ef1d060e11daf89e9ff68f4fdf3bd80f6b0653f2/content/posts/2024_11_14_virtual_geometry_bevy_0_15/index.md
I have 0 motivation to do BVH culling after I spent so much time on virtual geoemtry for bevy 0.15 😬
Guess I need to take a break
@fiery bolt for your BVH, for interior nodes, what bounding sphere do you use to project the error?
Your leaf nodes are cluster groups, with error = parent group error, and bounding sphere = parent group bounding sphere
And when building an interior node over those leaf nodes, you set error = max error of leaf nodes
But what do you set the bounding sphere to be?
A new bounding sphere enclosing all the leaf node bounding spheres?
(btw it's confusing because you use the DAG parent group LOD data, which is different than the BVH parent lol)
yup
it's just all the child BVH nodes' lod spheres merged
Thanks! This shouldn't be too bad to implement then. Just very confusing, because there's both DAG parents and BVH parents 😅
struct BvhNode {
child_start_id: u32, // If meshlet, is meshlet ID, else is pointer to BvhNode
child_count: u16, // If u16::MAX, then node is a single meshlet, else is BvhNode child count
error: f16, // If meshlet then is group_error, else if is lod group then is parent_group_error, else is max of child's parent_group_errors
bounding_sphere: vec4<f32>, // If meshlet then is group_bounding_sphere, else if is lod group is parent_group_bounding_sphere, else if new bounding sphere enclosing all children bounding spheres
}
😅
you should have groups as leaf children, not singular meshlets
what I do is SOAify my BVH into a BVH8, so a u8::max is a single node child, otherwise it's meshlet count
('single node' => actually holds data for 8 nodes)
this also reduces queue memory size by 8x
probably the main reason I did it tbh
Right, but I plan to reuse the same type for the meshlets within each leaf
Each meshlet needs a bounding sphere and error anyways
but also a tight cull sphere and other metadata
Because after the lod group check against parent error, you need to check the meshlets self error
Yeah I have that seperate
Mhm
Yeah so this confused me as well. But dag traversal is a lot more expensive than rearranging it into a bvh with a root per lod level, is why I think it's done.
hm
DAGs aren't really trees
they reconverge
so DAG traversal is more complex, as you have to ensure you don't revisit things
since we can already do LOD selection in parallel, we don't really have to follow the DAG, we can use any structure to accelerate it
thus, the BVH
@fiery bolt are you doing frustum + occlusion culling in the same kernel as the LOD traversal, or do you do hierchal LOD traversal, write all the meshlets to a buffer, and then do culling?
I frustum and occlusion cull BVH nodes too, and also have a separate meshlet cull stage
im confused
you have to visit the dag anyway as you instantiate meshlets, no?
what is the dag for at all if its not used?
the DAG exists during build
but you don't have to visit it, since LOD decision can be localized if you store current and parent error
so now you can just build a BVH out of parent error to accelerate stuff
yeah, from the BVH
i dont get it
its not a bvh tho its a dag
like what does the dag do then
at build time
the DAG isn't really important tbh
it doesn't really 'exist' at build time either
there's no data structure explicitly storing it
i see
it's just implicitly there with how groups relate to each other
but since you only care about the current group and parent group
you can BVH-ify it
ok wait
Heck? Do you have a seperate tight culling sphere for every BVH node?
AABB
but yeah
public struct BvhNode {
public Aabb aabbs[8];
public f32x4 lod_bounds[8];
public f32 parent_errors[8];
public u32 child_offsets[8];
public u8 child_counts[8];
}
@fiery bolt when a child finds out its parent is instantiated it kills itself?
Why do they store 8 values at a time per node?
SOA
but how do you do the lod cut then
so i can have one index in the queue for 8 nodes
I see
nanite also does a BVH8 iirc
So you basically just build a BVH, and then inline the 8 children into each node?
and then the children start/end become what?
but the parents share children with a dag
it's a single index for BVH nodes, with count == 255
count only matters for meshlets
ok so, firstly, do you understand how it works without a BVH?
if you just expand to all meshlets
no