#Iris - A Journey through OpenGL and beyond to learn Graphics
1 messages ยท Page 22 of 1
ok holy fuck it's rendering 1.2 million meshlets that doesn't sound good 
same scene in unreal takes 1.92 ms
yeah
@wicked notch what heuristic did you use for software raster vs hardware raster again? I'm having a hard time figuring out a good metric. Nanite briefly mentions that they compute some kind of longest triangle edge data per cluster and use that iirc.
project that to screenspace and software raster if it's less than 16 pixels or something iirc?
For now I've just given up on tweaking it, left it as something silly, and have moved on to other bits
Apparently Brian Karis read my blog post though, awesome!
if we never hear from you again, UE has won
He said he enjoyed it ๐
unlike that tencent talk
Jasmine let us know if Epic Games sends their hit agents against you
I saw both of Tencent's talks between Advances and Moving Mobile Graphics but couldn't understand most of it.
Wish I got a picture, but Jasmine's article was linked in Advances. Slides are up, let me link it.
ok so I just read the tencent slides and... it's just nanite but worse?
it's minimum effort nanite reimpl
instance cull, parallelize all meshlets, no software raster, and some wacky lod curve
Could the lod curve somehow compensate for lack of software raster? They were targeting mid-high end mobile.
if you don't target pixel sized triangles then yeah, sw raster becomes unnecessary
buuut the benefits of sw raster are elsewhere too
for example with VSM, sw raster is pretty much the greatest source of speedup
lies, culling is
culling is the second greatest source of speedup 
nah saky's right it's culling first and then sw rast
but the advantage of not going through the regular pipeline for triangles is massive
I was this close to getting into another insane argument
especially for small ones
no more
wink wink bistro trees
bistro trees 
I genuinely believe that sw rast for bistro trees would fix everything
but alas
I still don't have nanite shippable 
same
Hey deccer, just jumping in... what's going on? ๐
I'm doing hardware raster as well. Software raster doesn't pay off for anything larger than a pixel sized triangle. In the UE source code, it's literally called MicroPoly raster (...for a reason). There's overshading to compute hardware gradients. Even Nanite bins and rasterizes wider triangles via HW.
your name was part of references on some siggraph presentation, two or three message up from when i booped you
Holy shit
I didn't even realize that. People are referencing my reddit posts. Wow.
Granted, their limits are a bit fuzzy... but beyond a certain size the hardware is faster.
What kind of HW raster? Mesh shaders or?
I still haven't figured out a good heuristics for SW vs HW. NSight shows like +/- 0.5ms all the time for the raster pass making measuring perf impossible. Idk how to improve when the results are so variable.
Make the test scene bigger
Nah my target minspec is 1050Ti. Just regular ole'. Though I do frustum/HiZ culling compute and write out MultiDrawIndirectCount params into a buf.
There some great high poly megascans of giant buildings on Turbosquid... probably have to dish out a few bucks, but worth it. Take those and just replicate over and over side by side.
I tried with like 10 copies of the quixel megascan icelandic cliffs. Maybe that's not enough??? idk
more.meme.gif
1 draw indirect args per cluster? How do you not get super bottlenecked on the command processor? I had tried this before, it was fairly slow
Oh nice, thanks for the tip
Not clustered, complete instances.
But total number of issued calls is once per material.
But you have clusters(?)
whole-ass instances of geometry. Not limited to ~128 triangles.
I'm not doing Nanite.
I'm doing visibility buffer rendering only ๐
Ohhh ok
how many tris in one of them
my stress test scene has like 100 billion tris total 
spends about 10 ms in raster (hw, no sw)
well, i impld sw inside my mesh shader and that bumped it up to 20 ms lol
Not home rn, will check later
@wicked notch I sped up mesh generation significantly by generating meshlet AABBs correctly instead of going through every single vertex in the mesh 
crazy how that works 
it even speeds up rendering because now meshlets are being culled!
is your AABB structure something you wrote yourself?
yeah
i wrote some goofy bvh too
it... works?
the nanite presentation glossed over how it works so idk how correct i am
@fiery bolt I forget who is who, did you do the WebGPU Nanite impl?
(Scthe on github)
@fiery bolt is your code open source anywhere? I'd like to check out your meshlet DAG building/simplification code
i think i'm using meshopt for simplification currently, not my own thing
changed to meshopt to debug something then forgor to switch back 
Looks... very similiar to my own code ๐
Did you base it off of mine? (which is totally fine, just curious what your process was)
i did yeah
Nice, happy it helped

If you end up learning anything, let me know please!
So far today's experiments have revealed:
- Setting target error = 1.0 helps a lot. No reason to limit the target error.
- 255 v / 128 t is better than 64/64 (meshopt won't let me do 256 vertices :P)
yeah i did the first one
and nanite does the latter
i didn't do the latter exactly because meshopt doesn't like 256 lol
i do 128/128
well, 128/124 because mesh shaders
I got better triangle fill rate with 255/128 vs 128/128
what's your test model?
Stanford dragon, currently
that is shade flat by default so you have zero vertex reuse
i had that issue with all the three stanford models
I have:
- Stanford dragon
- Stanford bunny
- Jinx from arcane form sketchfab merged down to 1 mesh (I don't really support multiple materials per meshlet mesh)
- Icelandic cliffs form quixel megascans
wdym?
every tri has unique vertices
the bunny too
oh yeah i also use the max of error instead of adding the max child
(and lucy)
I was doing that originally, but then your error is not cummulative between LODs
So if LOD 0 has 0 error, LOD 1 has 10 error, and LOD 2 has 20 error
LOD 2's total error should be 30 relative to the bash mesh (LOD 0)
not really
because it's a world place displacement
higher LODs will naturally have a higher error
Why? And that's tangential, no?
lemme try to ms paint something
You don't want to know error relative to the previous LOD, you want to know error relative to the original mesh right?
actually no idk how to draw it lmao
basically, when you simplify, you're collapsing edges into vertices
and the simplification error is the maximum displacement a vertex moved (sort of)
as you simplify more, edges get longer
so when you collapse, you get a higher error naturally
Sure, and that's a consistent metric for DAG cut purposes, yes
i think this would lead to double-counting if you add
making your error higher than it should be
But that means your error projection is no longer saying "is this LOD imperceptible from the base mesh at this distance"
the nanite presentation says they max, not add iirc
so in this case, if you're collapsing the red edge
you get this as the output
then you collapse this
leading to this
Compariosn of 255/128 vs 128/128 on icelandic cliffs btw: https://paste.rs/AFqkf.txt. Meshlet occupancy is a map of triangles_per_meshlet:count_of_meshlets key:value
the total displacement in the two simplification steps is equal to half the length
which is equivalent to the second collapse error
can you rephrase that last bit?
I understand the images, but trying to understand the implications still
so if you compare the first and last image, the vertex in the center has moved half the total length, right?
that's your final error of the simplification
yeah, that's equivalent to the displacement done by the second collapse
Also I reread the nanite slides rq
They say they do the max of child's parent error for the BVH
TGhey don't mention how they handle choosing the parent error in the first place though
?
it's a leq comparison
so it doesn't matter
if parent >= threshold && current <= threshold
So if that's true, then it's this?:
- LOD 0 clusters have error = 0.0, parent_error = INFINITY
- Group and simplify LOD 0 to form a group
- Group error = max(group_error_from_simplify, all_child_errors)
- LOD 0 parent errors = group_error
- LOD 1 error (not parent) = group_error
yeah

i do that yep
why does megascans not have a way to sort by tri count smh
it is indeed add
they add child errors to the parent error?
they add the bounds but yes same thing in UE's case
the bounds of the group directly affect the error calculation
which is shrimply copied
me no understand
UE formatting

Where the heck does LODError come from?
i assume that's set to 0
blender -> subdivision modifier -> simple
be careful because you can turn those 2 million triangles into 2 quintillion
speaking from experience 
also for runtime error calculation, you should store the group bounding sphere and place your error test sphere on that, instead of at the center of the group
@primal shadow ^
but that's booooring
i'll just use lucy then
Wdym?
currently you project a sphere with center = group center and radius = error, right?
Yeah
that undercounts error for tris closer than the center
?
you have triangles closer to the camera than the group center right
so if you test the error at the group center, those triangles might have an error higher than what you calculate
page 11
Wait so what's the solution?
Use the meshlet's center, and don't have any center for the group?
store the entire bounding sphere of the group
and place the error test sphere at the closest point to the camera
yeah
Calculate closest point on the sphere to the camera
Then place the sphere at center = closet point, radius = error
yep!
the only issue is that idk what to do when the camera is inside the sphere
bounding or error sphere
because then your sqrt(d2 - r2) gets a negative value in it
Bleh that's going to be a lot more involved change, I'l add this to the TODO list
if you clamp the value inside sqrt to 0, you get infinite projected error
which is sus
i'm not sure
I mean yeah we want the LOD calculation to be as accurate as possible, but I feel like it's such a handwave to begin with...
idk what the source of my overculling is 
it's probably occlusion culling
but it might be this
My occlusion culling is just broken
Using SPD to build the depth pyramid is not correct unfortunately
yeah
Say your depth texture is a non-power-of-2
What size do you make mip 0 of the depth pyramid?
So 1800: 1800/2 = 900, rounded down to 512?
So you don't enforce that the new depth pyramid is a power of 2? Hmm
nope
I think that may be wrong, 1s
(depth_view.get_view_width() + 63u) & ~63u,
(depth_view.get_view_height() + 63u) & ~63u,
I think that rounds up to the nearest multiple of 64?
yeah
Are you doing that? That may be you issue if not
yeah i'm not
but why is he doing that
is it because each workgroup does a 64x64 block
and that he doesn't bound check in the shader 
Probably? Idk
Not sure if it's bounds checks though
Could be correctness
Like, extra bounds checks probably matters a lot less than extra vram usage from this
i'll add that and check
Hopefully you can figure it out, I don't really want to go stare at hiz code ๐ญ
i'll let you know
Wait before you start I have another question
When simplifying meshlet groups, which edges do you need to lock?
ones that are only a part of one tri
eloberate?
Which border though?
of the group
The border of the group, or border between meshlets?
you don't want to lock between meshlets no
only the group
that's the secret sauce to nanite
oh yeah i also use larger groups than 4
i think it helps simplification perf
What do you use?
Assertion failed: index_count / 3 <= kMeshletMaxTriangles, file vendor/src/clusterizer.cpp, line 717
The groups are too big for meshopt to split ๐
Oh btw, are you using METIS or meshopt for clusterizing?
Interested to know if you compared the two at all
meshopt
might try metis
how are you splitting?
build_meshlets(indices, vertices, 255, 128, 0.0)
in each group?
i do the same
it shouldn't be causing issues
meshopt clusterizes the source mesh after all
and it does lucy (28 mil tris) in 2-3 min
oh i also made this shitty python script that automatically tiles any gltf
Hah, you think I can load gltfs ๐
shouldn't bevy be doing that for you 
Yes but not for meshlet meshes
Our asset processing APIs are sadly fairly poor atm, so I don't have a good way to convert GLTF -> bunch of meshlet meshes + scene file
(yet)
ah

my issue is that my engine is not an engine
the only thing it can do is load gltfs 
Join Bevy ๐ฆ
does bevy do vulkan 
No, wgpu.
There is an alternative vulkan backend in a non-official crate, but it dosen't work with any existing stuff ofc.
Thinking this is the best description of my own effort TBH.
Now I just need an excuse to skip on my externally broken occlusion culling.
I swear it was broken when I found it. No way to put it back together.
Based on recent discussion, bumped to 255v/128t, and some other changes https://github.com/bevyengine/bevy/pull/15023
that code looks so fragile because you arenโt using any constants for these values
Code quality comes later :p
itโs something iโd put under โcode correctnessโ which would be the first thing to work on
There's too much else to work on in the mean time, small stuff like this is not a priority.
did some optimization and now it culls 1 million stanford dragons in 2 ms 
raster takes 23 ms though (only hw, no sw) 
800 billion tris at 30 fps
But can it support foliage
i have acheived scene independence
7.2 trillion triangles at 60 fps
@primal shadow the edge detection really does help simplification massively
thanks for the insight
not really no
the other stanford dragon with 7.2 million tris imported in about 1.5 min
ยฏ_(ใ)_/ยฏ
Well, I'll take a closer look at your code soon
I need a break from DAG building ๐
good question lemme check
Also do you have a link to your github? I lost it
it is... whatever this is
uhh
I mean it's kinda dragon shaped
sure
Managed not to turn into a sphere
probably because meshopt doesn't generate vertices
You went back to your own simplifier?
Also hey, at some point I'd appreciate an explination on how to implement subpixel SW raster
The stuff I found online never made sense to me
And it seems like you implemented it
nah i haven't yet
but i will soon
i'll let you know when i do
nah i'm using meshopt
Oh nvm you're saying it didn't turn into a sphere because of this
yeah
Ah gotcha, nvm yeah looking at your sw shader I see it was something else. Thanks!
my sw rasterizer is actually really bad
doing everything in hardware is very slightly faster lmao
Really? I saw way faster speeds with SW raster
Did you have mesh shaders already though?
yeah it's all mesh shader based
and i do backface culling in the mesh shader
That explains it. Nanite was started before mesh shaders, and mesh shaders give a lot of similiar speedup.
i think with a bit of optimization and async compute overlap i can eek out a fair bit of perf with sw tbh
mainly the async compute overlap
the mesh shaders are completely bottleneck on hw so they have terrible util
Yeah I don't have mesh shaders or async overlap ๐ญ
impl mesh shaders into wgpu + naga and use them on native 
naga isn't terrible
i rewrote/restructured it for module-level scoping long ago
wasn't too bad
no idea how the code is today though
insane
Question about Nanite/what you guys do
Meshlets are merged as the LOD gets lower to prevent edge cruft
Doesnโt this rely on the LOD of both meshlets decreasing? So what would happen if you need to lower one meshletโs LOD, but another meshlet must stay higher?
LOD isn't decide at meshlet level, but at meshlet group level
and you only lock the boundary of the group
and after simplification, you split the group into meshlets again
Lock the boundary of the group, *except for where it intersects the mesh border
compute analytic derivatives instead of using ddx/ddy perhaps
@primal shadow oh yeah i also found out that my border vertex detection was completely wrong, fixed it and pushed to my branch (PRed to your bevy repo)
might wanna test it out again lol
I am, my ddx/ddy aren't from the ddx/ddy() functions, I compute them as part of the visbuffer resolve
Oh, thank you, I appreciate it. I am extremely short on time this week, but please do remind me if I don't get to it by saturday,
will do if i remember
lmao i fixed edge classification which led to fixing cracks which led to significant occlusion culling improvements which now means i can do 7.2 trillion triangles in under 3 ms

4.3 +- 0.3 if i lock clocks to base
!remindme 3d remind jasmine about the thing
Alright deccer, I'll remind you in 3 days about:
remind jasmine about the thing
mobile drivers coming in clutch (words you'd never thought you'd ever see)
do you mean crutches?
@languid vector methinks i have found a solution
the original sphere projection algo
that takes near clipping into account
project the full bounds to the screen, and scale the error using that
thanks!
idk if it works yet
what was wrong?
let me guess - you classified edges without generating shadow index buffer with geometry-only data?
the only thing I wonder is how the hell you manage to do 7.2 trillion triangles in 4ms
Is hierarchy this major optimization?
nah I was unlocking edges that were mesh borders but not group borders, but did it completely wrong 
so it was unlocking group borders
leading to cracks
yeah it's really useful
you can get rid of a bunch of instances early
and never even consider more than 1 or 2 meshlets for them
you also do frustum and occlusion culling inside hierarchy traversal
so you can save a bunch of bandwidth too
atleast, I do because it's literally just a conventional bvh with error stuck on top
I've only done the frustum culling so far
I can't really understand what am I bound to since not a single metric is nsight is loaded more than 80%
memory
though "warp can't launch" is insanely high
or raster
which shader type
solved with sw raster?
task/mesh pipeline
what's ISBE alloc stalled at
I am not sure I understand what ISBE is ๐

I think it's memory allocation for mesh shader outputs
if your rasterizer is cooked you'll stall at that because it's still processing other tris
I also do nanite-style quantization so it should have saved a bit of bandwidth
I mean, it is nearly x4 compression
I have zero compression lol
it's not about mesh shader read bandwidth, it's about how fast the rasterizer can chug triangles
you're raster bound
I have never thought I will be raster bound 
and I guess the only way to solve it occlusion culling + sw raster?
afaik occlusion culling is a massive optimization
really massive
yes
how large is your test scene
25x25 grid of happy buddha meshes :>
how many tris is that
you should use the xyzrgb dragon 
it's got 7.2 million tris in it
or Lucy 
1.1m
27 mil iirc
yeah lucy is fat ass mesh
which gpu
1660 Ti
yeah you need occlusion cull
I guess so 
it takes 17 min to build for me lmao
holy cow

still better than unreal
unreal takes longer?
I let it run for an hour once and it froze
so I killed it
if this shit works I can finally start streaming 
"this shit" is a new sphere projection method?
yeah
I was just about to ask you if you already have implementation 
my dumb ass can't handle the maths now
I'm making coffee so I can consume copious amounts of caffeine to understand wtf the paper does
I personally can't understand the meaning of "conservativeness" in all these papers
every choice you make must overestimate error, not underestimate
Ahh, it makes sense now
your bounding box must be larger than the object, never smaller
because otherwise you will cull too much
yess, it makes sense now. thanks!
alr, it has 3 out variables:
out vec2 perpendicularDirection, out vec2 U, out vec2 L
now need to figure out what tf to do with it
uhhhhhhhhhhh error is decreasing as i get closer
so it aint working
i've done something wrong
so as I understand, it takes view and sphere and outputs a polygon with N sides with coordinates in screen-space? I guess I only need AABB from the sphere
the core algorithm take a sphere in view space and tells you the min, max along a specific axis
so I simply build an AABB and then compute its area that I use as error estimation?
"simply build" using this algorithm I mean
i'm doing y axis so it's always pi / 2
yup, makes sense
does it output size in screen-space?
or I need to project it further
I see
I guess you are better at maths than me, so if you don't mind, can I come back with questions if have some after reading of the paper?
thanks!
same
no, it indeed returns nans and I can't understand why
@fiery bolt so I figured out the issue. I have inverse Z camera and was passing inverted clip planes to the algorithm ๐คก
tho haven't figured out how to use U and L for projection yet
I mean, near plane was further than far plane
which produced a tons of NaNs due to sqrt of negative value
ah lmao
did u figure out how to use U and L?
that's the screenspace bounds
projected to the far plane?
yeah you need to project
how do I project projected stuff 
Alr I see, I will have a look at sample code
so yeah, it works relatively awful without reprojection
i think i have something that sorta works?
how did u fix the issue when error gets smaller if you get closer to the sphere?
vec2 parent_U, parent_L;
GetBoundsForPhiLengyel(0.0f, parent_projected_bounding_sphere.center, parent_projected_bounding_sphere.radius, camera_data.near_clip_distance, camera_data.far_clip_distance, parent_U, parent_L) ;
vec4 parent_projected_points[2];
parent_projected_points[0] = camera_data.proj * vec4(parent_U.x, 0.0f, camera_data.near_clip_distance, 1.0f);
parent_projected_points[1] = camera_data.proj * vec4(parent_L.x, 0.0f, camera_data.near_clip_distance, 1.0f);
const float parent_result_error = (parent_projected_points[1].x / parent_projected_points[1].w) - (parent_projected_points[0].x / parent_projected_points[0].w);
this is how I do it
public f32 project_error(f32x4 bounds, f32 error) {
let center = mul(this.mv, f32x4(bounds.xyz, 1.f)).xyz;
let radius = bounds.w * this.scale;
let err_frac = error / bounds.w;
if ((center.z + radius) <= this.near) return 0.f;
let dist2 = dot(center, center);
let a = sqrt(dist2 - center.z * center.z);
let t2 = dist2 - radius * radius;
let t = sqrt(max(t2, 0.f));
let in_sphere = t2 <= 0.f;
f32x2 bounds[2];
// cos(theta) = t / dist
// sin(theta) = r / dist
// T = (rotate(theta) * (a, z) / dist) * t,
// removing the dist divide in cos, sin
// ncos(theta) = t
// nsin(theta) = r
// rotate(theta) == rotate(ntheta) / dist
// therefore, T = (rotate(ntheta) * (a, z) / dist2) * t
// saving us two divides and a sqrt!
var v = in_sphere ? f32x2(0.f) : f32x2(t, radius);
let clip_sphere = (center.z + radius) >= this.near;
let off = this.near - center.z;
var k = sqrt(radius * radius - off * off);
[unroll]
for (int i = 0; i < 2; i++) {
if (!in_sphere)
bounds[i] = mul(f32x2x2(v.x, v.y, -v.y, v.x), f32x2(a, center.z)) * v.x / dist2;
let clip_bound = in_sphere || (bounds[i].y < this.near);
if (clip_sphere && clip_bound)
bounds[i] = f32x2(a + k, this.near);
v.y = -v.y;
k = -k;
}
let ndc_size = abs(bounds[0].x / bounds[0].y - bounds[1].x / bounds[1].y) * this.h;
// NDC size has a range of [0, 2] mapping to [0, height],
// but don't divide by 2 because the error is divided by 2 at build-time.
return ndc_size * err_frac * this.screen.y;
}
slang
got it
Alr, I will get back to the code tomorrow, gotta sleep. it is 3 AM in my country 
you still have 3 hours till bedtime!
make that 6
Time to debug tangents again
Btw after further thought, projecting the LOD sphere based on the culling sphere makes no sense to me
why not?
you'd have different projections for different parts of the same group, which makes no sense
how would you?
and when you do a BVH, it's based on the group, not individual meshlets
you're using the group LOD sphere
wait what culling sphere do you use?
a merged sphere of all lower lods' group lod sphere
to do BVH you need a bounding sphere or else traversal won't be monotonic
@fiery bolt so we have an answer for "how to project the sphere" question, but do we have an answer for "where tf to place the sphere" question? 
I mean if camera is inside the sphere
Just snap to camera position?
I still don't understand why the sphere should not just be at the center of the group
So test is conservative
You should never underestimate error, you can only overestimate
here is 2 groups that have the same error. Obviously the left one will have perceptually bigger error because it is bigger by itself, so it will be closer to camera
I ended up doing something different that seems to work
but it's rendering way too much
like, 2 million meshlets
my culling queues are filling up 
ideally u want 128 times less for 1080p monitor 
The test is made so that the error is not (in theory) perceptible, wouldn't making it more conservative just use more VRAM when you could use a lower fidelity lod for the same visual result?
the test only guarantees the error isn't perceptible when you use bounds
I understand that it keeps higher fidelity lods more, but I'm not sure it's really necessary
if you just use the center of the group it can't guarantee that
I may be missing the point completely but it bothers me that I don't see the issue ^^'
it is not required but it is kinda more valid
- assuming error at the center is 0.999 px, the error for all triangles in the group closer than the center (which should be about half) will be more than a pixel
- when you do BVH, you need the outermost node's projected error to always bound that of it's children, so if you just use the center of the node, groups closer than that (again, about half), might have a higher projected error than the BVH node itself

btw, do u clamp the error sphere position to camera position if camera is inside the group sphere?
i was sending my code but discord seems to be blocking it 
๐
i assume it's being blocked for spam
// 2D Polyhedral Bounds of a Clipped, Perspective-Projected 3D Sphere (Michael Mara, Morgan McGuire).
// We get the projected bounds on the axis that is the longest upon projection (need to be conservative!),
// which is the one from (0, 0) to the sphere's center.
public f32 perceptible_error_distance(f32x4 bounds) {
let center = mul(this.mv, f32x4(bounds.xyz, 1.f)).xyz;
let radius = bounds.w * this.scale;
if (center.z + radius <= this.near)
return 0.f;
let dist2 = dot(center, center);
let a = sqrt(dist2 - center.z * center.z);
let proj_center = f32x2(a, center.z);
let t2 = dist2 - radius * radius;
var t = sqrt(max(t2, 0.f));
let in_sphere = t2 < 0.f;
// cos(theta) = t / dist
// sin(theta) = r / dist
// T = t * rotate(theta) * proj_center / dist,
// removing the dist divide in cos, sin
// ncos(theta) = t
// nsin(theta) = r
// rotate(theta) == rotate(ntheta) / dist
// therefore, T = t * rotate(ntheta) * proj_center / dist2
// saving us two divides and a sqrt!
let ncos = t;
let nsin = radius;
let wt_z = dot(f32x2(-nsin, ncos), proj_center) / dist2;
var t_z = t * wt_z;
if (in_sphere || t_z < this.near) {
// let off = this.near - center.z;
// let k = sqrt(radius * radius - off * off);
// let t = f32x2(a + k, this.near);
t_z = this.near;
}
return t_z;
}
public f32 error_perceptible_at(f32 error) {
// Don't divide by 2 because the error is already divided by 2 during build.
return this.screen.y * this.h * this.min_scale * error;
}
public bool should_visit_bvh(f32x4 lod_bounds, f32 parent_error) {
return this.perceptible_error_distance(lod_bounds) <= this.error_perceptible_at(parent_error);
}
public bool should_render(f32x4 lod_bounds, f32 error) {
return this.perceptible_error_distance(lod_bounds) > this.error_perceptible_at(error);
}
there
idk why i'm returning t_z
yeah it does that
we can always increase the threshold tho 
idk how correct it is
How do you handle your parameter binding business with a rust/slang setup?
ngl I had no idea that was even possible
why is removing frustum culling reducing the number of meshlets drawn
but increasing the number of bvh nodes traversed (as it should)
task failed successfully
@fiery bolt ok I'm taking a look at your PR today
If i pass ptr::null() for vertex_locks, and add back SimplifyOptions::LockBorder, it should be equivilant to the old code right? For comparison purposes
So uhh it's a bit... aggressive
231 -> 4 meshlets is... a choice ๐
let me try this on the cliff instead of bunnies
Cliffs:
Something's a bit off
Anyways back to compression
probably? but not sure
are these numbers with the edge detection or without?
Wdym? Expand the results, I showed on main/your PR
like, is this after replacing vertex locks with a nullptr or without
hmmmmm
i'm assuming this is the simplification queue length?
so it doesn't mean that the whole mesh was 231 meshlets and it's now 4
what's the threshold at which you reject a simplified group?
my frustum culling has been wrong all along...
lmfao
@languid vector 7.2 trillion tris in 1.46 +- 0.29 ms
holy crap
I am rendering 500m tris at 30ms
:c
occlusion culling
but without occlusion culling
yeah
and without hierarchy
and without sw raster
occlusion cull is a bigger win than hierarchy
i'm gonna tune my sw raster thresholds rn
it's not that much of a difference
around 20% boost
what was the best boost for your virtual geom renderer?
I mean runtime performance
wdym
Yeah, amount of meshlets at each level
It does though
I allow simplifying at least 5% (i.e. 95% the same). Maybe I should change that and the test your changes again
(if you have mesh shaders :P)
no, because some groups would've been rejected and will not be simplified ever again
also true
yeah
i do 60% iirc
Tried 65%
LOD: 0, meshlet count: 15616, meshlet occupancy counts: {128: 15615, 88: 1, }
LOD: 1, meshlet count: 8066, meshlet occupancy counts: {128: 6406, 127: 1114, 64: 352, 63: 156, 126: 29, 62: 7, 43: 1, 125: 1, }
LOD: 2, meshlet count: 4719, meshlet occupancy counts: {128: 2925, 64: 507, 63: 465, 127: 222, 32: 157, 31: 133, 62: 98, 96: 62, 95: 59, 126: 27, 30: 20, 94: 17, 61: 13, 29: 5, 125: 3, 93: 2, 46: 1, 28: 1, 66: 1, 85: 1, }
LOD: 3, meshlet count: 1358, meshlet occupancy counts: {128: 811, 32: 11, 111: 9, 64: 9, 16: 9, 48: 9, 11: 9, 113: 8, 109: 8, 9: 8, 118: 8, 10: 8, 79: 8, 95: 8, 23: 7, 47: 7, 120: 7, 122: 7, 29: 7, 63: 7, 73: 7, 94: 7, 6: 7, 127: 7, 93: 7, 106: 7, 99: 6, 34: 6, 45: 6, 117: 6, 70: 6, 12: 6, 75: 6, 97: 6, 33: 6, 55: 6, 5: 5, 98: 5, 19: 5, 53: 5, 7: 5, 112: 5, 74: 5, 100: 5, 14: 5, 22: 5, 65: 5, 101: 5, 110: 5, 25: 5, 28: 5, 49: 5, 27: 5, 68: 5, 61: 4, 51: 4, 56: 4, 119: 4, 126: 4, 44: 4, 13: 4, 24: 4, 60: 4, 102: 4, 71: 4, 15: 4, 123: 4, 18: 4, 76: 4, 39: 4, 50: 4, 124: 4, 26: 4, 115: 4, 116: 4, 17: 4, 30: 4, 91: 4, 31: 4, 80: 4, 2: 4, 96: 4, 37: 4, 52: 3, 46: 3, 43: 3, 62: 3, 121: 3, 125: 3, 4: 3, 90: 3, 89: 3, 72: 3, 20: 3, 85: 3, 84: 3, 114: 3, 35: 3, 40: 2, 83: 2, 8: 2, 1: 2, 41: 2, 3: 2, 103: 2, 21: 2, 108: 2, 81: 2, 82: 2, 57: 2, 42: 2, 36: 2, 78: 2, 88: 2, 38: 2, 104: 1, 58: 1, 77: 1, 107: 1, 69: 1, 86: 1, }
is that the cliff?
huh i get
561, 284, 152, 80, 40, 19, 10, 6, 3, 2
meshlets per lod
for the bunny
Yes
Idk. I'm doen with trying to improve DAG building atm.
Next todos are compress meshlet data and then BVH-based persistent culling
you can't do persistent queues with wgpu atm
wgsl doesn't support coherency
Hmm, do you need it?
yeah coherency is required for any non-atomic writes to be visible to other workgroups in the same dispatch
same reason it's needed for SPD
Which part needs those?
the whole persistent queue shtick?
Right I know what it does, I'm curious what needs it though for persistent queues
Oh wait
you're writing to a queue lol
yep
yeah I see
Isin't it litterly just adding coherent (glsl) / globallycoherent (hlsl) to the buffer decleration though?
and so does nanite on PC
yep
Ok I'll patch naga, easy enough
oh yeah you also need forward progress guarantees
That, or spirv passthrough in bevy
yeah that works
Yeah ik avbout that part
let me find the spirv thing
metal on M series fails that
https://registry.khronos.org/SPIR-V/specs/unified1/SPIRV.html#Decoration coherent decoration
so you have to fall back to dependent dispatches for apple silicon
fails whag
Naga is already broken on metal when it comes to atomic<u64? anyways so idc
you don't get forward progress on metal
whatโs that
it'll just keep spinning
guarantees that workgroups will get switched out eventually even if they're not waiting for a mem access
pretty sure 64-bit atomics exist
no API actually gives that to you
They do, but naga's MSL backend is bugged on them
why the hell would you want that
but it mostly works on nvidia and amd
spinlocks
skill issue use a better api
so nanite doesn't use them on PC
dependent dispatches work well enoughโข๏ธ
I might do persistent queues after streaming
maybe
I'm pretty sure cuda does define forward progress guarantees
so that's why it mostly works on NV
ahhh i understand
those CPUs that crash a lot
should I do the whole fixed size page thing or half-ass dynamic allocation 
fixed page all the way
why would you even want dynamic alloc
just do tlsf on gpu
it's just a couple of fls and ffs
you might need a lock tho
acutally yeah no binning is bad 
do standard gpu malloc
because I don't have to do the goofy group part nonsense
@languid vector I have some more questions on the global mesh compression when you get a chance
- I don't get the idea behind the function you sent me a bit ago to calculate the bitrate/step_size for the global/mesh quantization grid. Should this not be a fixed value used for all of your meshes? Also, what's even the point of the global quantinization since you store your meshlet centers in a full vec3<f32> anyways no?
- Given that each meshlet stores a bitstream of vertex positions, for meshlet X with triangle index Y, how do you read the vertex position data? With fixed-size positions, i.e. one vec4<f32> per vertex, I can just store one u32 per meshlet pointing to the start of the meshlet's vertices in the large array of vertices, and each triangle index can be a single u8 pointing to an offset off of that starting position. But I'm not sure how to structure things with a bitstream.
- When quantizing, I'm not entirely sure how to handle sign-ness (i.e. negative or positive). For meshlets, I guess I could map -radius..radius to 0..diameter, and then do ceil2(log2(diameter)) to determine the bitrate. For meshes/global grid, I guess I just store the meshlet centers as a full vec3<f32> still? (but quantized to the grid instead of absolute coordinates)
but goofy groupy thingies are cool
how do you check if all parts of a group are loaded 
without Yet More Indirection
literally who cares
you're already bw limited
one more buffer
and maybe another one after that
no but then I have to deal with those buffers
dynamic gpu malloc
me when I commit theft
The Khronos group when creating the Vulkan specification thought about you specifically when creating descriptor sets
oh yeah it's also more work when building
what the fuck is a descriptor set
what's a descriptor set
watch your language
never heard of er
me neither
do you mean pointers maybe
I thought you were a vulkan man
I am yes
I just put a Tex2D<f32> in my push constants
I am a vulkan 1.3 man
are you pulling my leg
and it just works
yes we all use bindless and pointers here
descriptor sets are a bad dream that don't exist anymore
we have vulkaned someone else 
an honest day's work
we need more fresh blood for john khronos ๐
not yet
sorry you cannot resist
i am a GL guy
not for long
pointers
pointers
lol
useless
no
meaningless
nanite makes you a real clown
its all meaningless
true and real
ok time for you to impl nanite
watch out for edge cases
the only thing desirable is hw rt but everytime i mention it L says โNo.โ
I did on Scratch
ok now make it fast
no
28 trillion trongles in 1.5ms
iโm not scared of vulkan its just useless to me
hit that and we'll let you not use vulkan
pointing to what?
memory
if you can't do 25 trillion triangles in 1.5ms you have to use vulkan
neither have you
is that the error projection equation
or edge equations
quadric error metrics maybe perhaps
my error projection makes no sense
ill stick with IDs thanks
real
I just did the screen space projection thing and then multiplied two things
because just using t led to holes
best error function ever
this is like the dark wizards trying to convince me to use dark magic
show him
the slang push const
give him a taste of slang's ultimate power when combined with vk13
(don't tell him about the driver bugs and invalid spirv and out of spec optimizations)
fuck nvidia's driver
all my homies hate nv
me too
i shouldve gone with rx
fucking useless thing crashes with a misaligned addr if i use groupshared mem in my software rasterizer
I have zero idea how they say slang is "production ready"
i've looked through the entire spirv slang shits out
"release" "driver"
it's perfectly fine
so i wrote it in hlsl
works now
now i have three definitions of my gpu scene structs 
bruh
this was in a minimized diff that just wrote zeroes to the arr
still crashed with slang so i assume that's the issue
and using groupshared arrays in my mesh shader also works
the whole thing is ub
but atomics don't!
slang default InterlockedAdd uses device scope for everything so i wrote my own spirv asm thingy for VMM (and all barriers etc)
it still died
do you think novideo will hire me if i tell them i'll fix their shit shader compiler

oh yeah VMM make available | make visible doesn't work either for some reason so i had to split my dispatches
what is that
Haven't implemented that yet.
I guess to answer my own question for 2., each meshlet can store the first bit of it's vertex positions within the bitstream, as well as it's quantization factor, which can then be used to calculate how many bits each vertex position uses for the meshlet which is fixed per-meshlet, giving you random access within the meshlet.
@wicked notch in cuda and the likes you use a special dispatch for forward progress
and the api requires that your dispatch size is under some limit
for forward progress guarantees to hold
and the limit varies device to device
and yes that would be useful in vk no less, I'm just saying it's a bit tricky to use
it's not super straightforward
ye it's flimsy
Yeah yesterday
pub struct MeshletMesh {
/// Bitstream-packed vertex positions.
pub vertex_positions: Arc<[u8]>,
/// Octahedral compressed normals and uncompressed texture coordinates for vertices.
pub vertex_attributes: Arc<[u8]>,
/// Triangle indices for meshlets.
pub indices: Arc<[u8]>,
/// The list of meshlets making up this mesh.
pub meshlets: Arc<[Meshlet]>,
/// Spherical bounding volumes.
pub bounding_spheres: Arc<[MeshletBoundingSpheres]>,
}
/// A single meshlet within a [`MeshletMesh`].
#[repr(C)]
pub struct Meshlet {
/// The bit offset within the parent mesh's [`MeshletMesh::vertex_positions`] buffer where the vertex positions for this meshlet begin.
pub start_vertex_position_bit: u32,
/// The offset within the parent mesh's [`MeshletMesh::vertex_attributes`] buffer where the vertex attributes for this meshlet begin.
pub start_vertex_attribute_id: u32,
/// The offset within the parent mesh's [`MeshletMesh::indices`] buffer where the indices for this meshlet begin.
pub start_index_id: u32,
/// The amount of vertices in this meshlet.
pub vertex_count: u8,
/// The amount of triangles in this meshlet.
pub triangle_count: u8,
/// Number of bits used to quantize vertex positions within this meshlet.
pub bits_per_vertex_position: u8,
/// Unused. (TODO: Get rid of this in the disk representation?)
pub padding: u8,
}
Ok, got this so far.
why is everything Arc 
I thought it was atomic ref counted
Idk what the ref countedness means in the context of a single uint
it's just shared_ptr
Reasons internal to how bevy's renderer works to avoid copying the data across threads
Yeah, but I also need some sort of smart pointer inside anyways to store unbounded arrays [u8]
It was either Arc (shared_ptr) or Box (unique_ptr)
or you can do a custom DST 
?
you can make your own DST with a header and an unsized tail of bytes 
Ehhh maybe some other time if it becomes an issue lol
Compression continues to frustrate me greatly
I still have not seen good explinations for how half of it works
Mostly the purpose of the global grid
The global grid is there to avoid cracks in the model between different meshes. If every mesh would have its own grid then there would be cracks because each mesh aligned differently
What are people's guidelines for minimum mesh vertex counts for instancing? I understand you want at least 64 to fill a threadgroup, but various quotes online seem to indicate a few hundred is better. (In this situation, it's an instanced quadtree where I can control vert count)
yeah this is required for kitbashing or any sort of modular design
I don't understand why
Nor do I think that's right
I think nanite was saying you need the same grid to avoid cracks between objects you would get from different grids
But unclear why they have one in the first place
but isnt that what lukasino said?
Right, but it's missing some context
Yes if you do different grid sizes per mesh there would be issues
But what's the purpose of the grid in the first place?
Like you could also solve the same problem by... Not quantizing anything to a grid
how would you quantize it
without quantizing you need the full 12 bytes for position
and it won't byte-compress very well I think
But you can compress with the per-cluster encoding, which afaik(?) is lossless(?)
So it'll compress to a more compact bitstream regardless
I think the global quantizing step beforehand is just to reduce the precision, in order to need less bits for the second step? Idk for sure
And then it makes a bunch of other things more complicated
There's two steps. The quantization with a fixed step size for all meshes, and then encoding vertices per-cluster
I think the purpose of the first is to reduce excess precision (so less bits are needed), and the second is just a more compact encoding of the same data
if your compression doesnt yield the exact same vertex positions ror border vertices you get cracks
i guess the world grid helps keep the qualtization consistent
i dont think the quantization is lossless
looks like caldera hotel
I'm so cooked
why do I recognize that this is bistro
oh hehe
I did too, don't feel bad ๐
Why though? How does that work? I haven't figured out why it helps
because within a cluster you may just happen to have vertices at 0, 2, 4, 6, 8.0000000901, 10
@wicked notch i might have posted it somewhere before, im not sure, but it mentions mesh decimation/remeshing etc not sure if its useful or was useful as an alternative or better working meshlet thingy https://github.com/pmp-library/pmp-library but ill link it anyway
very nice, I'll check it out in a bit
Interesting issue I've found, shadows look too bad atm in bevy for meshlets
It's choosing too low of a lod I'm guessing, and the error is very visible when viewed by the main camera
I probably need to add a lod bias to shadow views
Ironically nanite does the same, but they bias towards a less accurate lod, as VSM is so high res anyways it's never a problem
Mhmm also true
VSM is too powerful
Rasterization is a hack and does not work for non-primary views
And I won't pretend otherwise
Banned
banned
Promoted to Admin
not you too
writing a raster renderer just feels like strapping hacks upon hacks together
don't waste your 0.25 rays per pixel on shadows, use them for GI and real specular smh
Ok after much research, I finally understand nanite's vertex quantization now
/// A single meshlet within a [`MeshletMesh`].
#[derive(Copy, Clone, Pod, Zeroable)]
#[repr(C)]
pub struct Meshlet {
/// The bit offset within the parent mesh's [`MeshletMesh::vertex_positions`] buffer where the vertex positions for this meshlet begin.
pub start_vertex_position_bit: u32,
/// The offset within the parent mesh's [`MeshletMesh::vertex_normals`] and [`MeshletMesh::vertex_uvs`] buffers
/// where non-position vertex attributes for this meshlet begin.
pub start_vertex_attribute_id: u32,
/// The offset within the parent mesh's [`MeshletMesh::indices`] buffer where the indices for this meshlet begin.
pub start_index_id: u32,
/// The amount of vertices in this meshlet.
pub vertex_count: u8,
/// The amount of triangles in this meshlet.
pub triangle_count: u8,
/// Number of bits used to quantize vertex positions within this meshlet.
pub quantization_bits: u8,
/// Number of bits used to to store the X channel of vertex positions within this meshlet.
pub bits_per_vertex_position_channel_x: u8,
/// Number of bits used to to store the Y channel of vertex positions within this meshlet.
pub bits_per_vertex_position_channel_y: u8,
/// Number of bits used to to store the Z channel of vertex positions within this meshlet.
pub bits_per_vertex_position_channel_z: u8,
/// Unused. (TODO: Get rid of this in the disk representation?)
pub padding: u16,
/// Minimum quantized X channel value of vertex positions within this meshlet.
pub min_vertex_position_channel_x: f32,
/// Minimum quantized Y channel value of vertex positions within this meshlet.
pub min_vertex_position_channel_y: f32,
/// Minimum quantized Z channel value of vertex positions within this meshlet.
pub min_vertex_position_channel_z: f32,
}
The perfect 256 bits of metadata
Do you have duplicated vertex positions for each meshlet? Meaning each vertex had its own "vertices"
good
I assume you meant each meshlet has it's own set of vertices, and yes I do. I was skeptical at first, but it allows streaming and much better compression, and vertex data memory usage is a large bottleneck.
Yes, thats what I meant. I am just checking ๐
Update: It's been taking a while due to learning and then being sick, but my fever finally broke and I finished all the CPU-side changes for compressed per-meshlet vertices.
Just need to figure out how to do the GPU bitstream reader
I am also using a fixed quantization factor per mesh rn, nanite has an "auto" mode that chooses the best one, I'll have to figure out how they did that.
don't all meshes need to have the same quantization factor
because it would lead to cracks otherwise
Maybe, idk. The original nanite presentation said it's user-selectable and has to be the same for different meshes, but unreal has an "auto" option.
Choose the precision this mesh should use when generating the Nanite mesh. Auto determines the appropriate precision based on the size of the mesh. The precision can be overridden to improve precision or optimize disk footprint.