#Iris - A Journey through OpenGL and beyond to learn Graphics

1 messages · Page 23 of 1

primal shadow
#

from unreal 5.0's docs

#

We achieve this by performing quantization in object space using a user-selectable
power of step size centered around the object origin.
It is crucial that the step size is not normalized to the bounds of the object or in other
ways tied to its dimensions.
From the nanite siggraph presentation

#

So yeah, feels a bit contradictory

#

I assume they just choose a step size based on how big the object is, under the assumption that meshes that will be used together should have similar sizes, and therefore the same size-based heuristic should lead to the same quantization factor, and it'll work out fine when the meshes both get used with transforms multiples of the step size

primal shadow
#

memory allocation of 4548506711262943144 bytes failed uh oh

wicked notch
#

imagine failing to allocate 4039 petabytes smh

#

are you stuck in 2024

primal shadow
#
memory allocation of 4548506711262943144 bytes failed
[Inferior 1 (process 209) exited with code 011]
(gdb) bt
No stack.
(gdb)

wtf

wispy spear
#

how many exa bytes is that

primal shadow
#

idk what went wrong :((

primal shadow
#

oh whoops, wasn't my code's fault

#

forgot to copy paste some asset writing code into my testing setup

buoyant summit
#

it exited

#

normally

#

if you were to crash it as like with abort(), then it'd work

primal shadow
#

Nooo, I finished my shaders for decoding, and all it's rendering is points D:

primal shadow
#

hmm, so every vertex has the same position, huh...

#

oh, LOL

#

totally forgot to use the vertex_id parameter of my get_meshlet_vertex_position() function 😅

#

that would explain it

#

oh god

fiery bolt
#

perfectly shippable

wicked notch
#

looks bunny enough to me

#

ship it

frank sail
#

it's a buggy

fiery bolt
#

bugs bunny!

primal shadow
#

I tested it on a cube, it's rendering as a 2d plane

#

heck

primal shadow
#

Bleh I can't figure out how to fix it, as renderdoc seems to be showing my fake values in the debugger :/

wispy spear
# primal shadow oh god

artifishial paint simulator, i like it. that blue bunny on the right ackchually looks quite cool

#

i will steal the image and use it as the new server banner : )

primal shadow
#

Ah I have discovered my issue, fml

#

the bitstream reader I did on the GPU does not account for when the bits of a vertex are split across two different buffer elements :/

primal shadow
#

Spent 3h trying to fix this, broke my brain

#

didn't figure it out

primal shadow
#

People in #webgpu helped me out, think I got it working

#

Shading looked a bit off though, I think I'm compressing normals too much

primal shadow
#

I'm not compressing UVs, but so far per-meshlet buffer is not a size savings, at least for the bunny mesh

#

I guess the triangle compression I can do should improve it more

#

And streaming will give me memory savings

wide shadow
#

compressing UVs should be safe from f32vec2 -> u32 (packed 2 halfs)

primal shadow
#

I've heard very bad htings about 16bit UVs

wide shadow
#

I mean it depends on the asset bleakekw

primal shadow
#

Ok yeah, using octahedral encode, and snorm2x16 is too inaccurate

#

Wait am I supposed to be using snorm, or unorm 🤔

#

What's the outpout of octahedral encode?

#

[0,1] or [-1,1]?

wispy spear
#

@wicked notch Mega Lights 2.0 when?

delicate rain
#

Are they rt btw or is it not known?

#

Must be rt no?

wispy spear
#

i have no idea, only saw the summary at gamesfromscratch

primal shadow
delicate rain
#

Interesting

primal shadow
delicate rain
#

Stochastic light sampling doesn't really say much

#

But yeah makes sense, thank you!

primal shadow
#

Pray I do not quantize you further

fiery bolt
#

automatic minecraft shader

#

ship it

fiery bolt
wicked notch
#

fallback repr

fiery bolt
#

that's not 'properly'

#

that's cringe

#

incredibly cringe

#

so cringe that i'm probably gonna do fallback repr myself

primal shadow
wicked notch
#

it is what unreal does

#

you might not like it

#

but unreal doesn't care that you don't like it

fiery bolt
#

but what if I ping the entire epic games developer github org

wicked notch
#

hmm bold strategy

#

almost as bold as the king's gambit

fiery bolt
#

@primal shadow just fixed another silly mistake in the edge detection, maybe try it now 🙃

ebon ruin
#

Hello Nanite folks

#

my meshlet generator sometimes creates split meshlets where a meshlet would be in two pieces (that can be quite far from each other). Will this be problematic when implementing the LOD tree?

wicked notch
#

yes

ebon ruin
#

Here's an example:

#

oops

ebon ruin
primal shadow
primal shadow
primal shadow
#

Question for people: how are you generating the LOD bounds for the group?

#

not for individual meshlets

fiery bolt
#

for base groups, merge the bounds of the meshlets (or just calculate it directly)

#

then merge the group bounds of all meshlets for higher LODs

primal shadow
#

Oops forgot to update it here, I figured it out

#

Yep that's what I figured out was correct

#

Except apparently getting a minimal bounding sphere of bounding spheres is a 183 page thesis xD

#

There's an open source implementation in the cgal library, but it's gpl v3 :/

#

So guess I'll go with an approximate method

primal shadow
#

Did some asset size / quality / perf comparisons between compressed per-meshlet vertex data, and a single set of uncompressed vertex data shared between all meshlets. Sadly asset size is nearly identical. On the upside, quality and perf are also identical, I can implement streaming now, and there's still room to further compress vertex data so more wins in the future hopefully. https://github.com/bevyengine/bevy/pull/15643#issuecomment-2395198350

#

UVs are completely uncompressed (64 bits), normals can probably be quantized and variable-length encoded similiar to positions rather than the current 32 bits per vertex, and triangle data can be compressed with fancy triangle strip encodings (currently 24 bits per triangle, 8 bits per index * 3)

faint crane
fiery bolt
primal shadow
#

I'm realizing I have no idea how nanite's error projection is supposed to work

#

Each cluster needs: bounding sphere of it's group, and of it's parent group

#

and then, parent group boundign spheres must strictly encompass all of their child bounding spheres

#

but then how do you mix error into this setup? Where does the error come in during the building and runtime steps?

#

I suppose the test is group_can_be_rendered = projected_sphere_radius(group.center, group.radius) < group.error_radius

#

I.e. the group has error_radius deformity from it's children, so if the size of the group on screen is less than that, it's basically equivilant to it's children, so it's ok to render

#

no, that dosen't seem quite right either

fiery bolt
#

yeah me neither

#

it's a pain

#

what i was originally doing was placing a sphere on the closest point of the lod bounds and checking it's projected screenspace radius

#

(and clamping to the camera)

#

but that leads to holes for some reason

#

now i do this

#

but i forgot what it's actually doing bleaker_kekw

#

and it sometimes leads to double-rendering i think

fiery bolt
#

(please tell me if you come up with something better)

primal shadow
#

For each cluster, store culling bounding sphere, lod bounding sphere, and error

#

Leaf meshlets (i.e. initial set of starting meshlets): Generate the culling bounding sphere, lod bounding sphere is a copy of the culling bounding sphere, error = 0

#

And then you group meshlets, simplify the group, and split into new meshlets

#

And for each new meshlet: compute culling bounding sphere, lod bounding sphere = new sphere encompassing lod bounding sphere of all children in the group, and error = max(simplification_error, child1_error, child2_error, ...)

#

Ok so that's building, now you have for each cluster: culling sphere, lod sphere, and error

#

now at runtime you gotta do this

#

tight bounding sphere = meshlet cluster bounding sphere

#

...and this runtime part I'm still reading the paper to figure out

#

but anyways you do this, and also for the parent sphere somehow?

#

and then you draw if self == good && parent == bad

#

finding the minimum enclosing ball of points (for leafs) and minimum enclosing ball of balls (for
all other nodes) [35]
Oh hey, they reference fischer's thesis, ok cool so I was on the right track with that

primal shadow
#

it seems like you're supposed to take the cluster's bounding sphere, find the closest point on the surface to the viewport, and then project a new sphere where the center is that point, and the radius is your error

#

so that tells you whether or not the current cluster has visible error, but what do you do about the parent???

#

And when you're building a BVH like nanite does, you can't involve the cluster bounding sphere at all, it has to be based solely on group/LOD data

#

So nanite must be doing something different here

#

really I think my original idea I've been using for the past year is on track

#

the cluster bounding sphere dosen't matter

#

what matter is for the cluster's group, and the cluster's parent group (group with cluster in it before simplifying),

#

given the LOD bounding sphere (located somewhere), with radius = group error, is the projected size of that group small enough such that the error is invisible?

#

The problem is, where do you choose to locate that sphere?

#

that's really the key question

#

because if you're saying radius = error, and you force error to be monotonic, then the bounding sphere projections will always be monotonic if they're located in the same spot

#

the issue is if you start moving where the bounding spheres are, then you run into issues

#

so where the heck do you choose to locate it??

primal shadow
#

I think the way Nanite does this is not neccesairly straightforward sphere projection

#

You have the group bounding sphere (encompasing all child group bounding spheres), and the group error of the cluster

#

And then you somehow project that error to the screen using the group sphere bounds

#

but it's not projecting the sphere itself? Something like that

primal shadow
#

tight bounding sphere = group bounds

#

and then you find the closest point on the group bounds to the viewport, make a new sphere centered there with radius = error, and then calculate projected size of that sphere?

fiery bolt
# primal shadow Mayeb it _is_ this

that is using the saturated sphere (lod bounds) and placing the error sphere on the closest point on that, and then projecting it to the screen

#

unfortunately it doesn't work if you're inside the lod bound sphere

#

or it doesn't work with a bvh, idk

#

didn't work for me when i tried it

primal shadow
fiery bolt
#

so there's holes in the mesh when the camera is inside one lod bound but not the other

#

or something like that

#

instead i calculate the distance that the error would be less than a pixel and check if the closest point on the lod sphere (or something like that) is closer or farther than that

primal shadow
#

Forcing you to pick a finer LOD

#

Think I'm going to try that with projecting an error-radius sphere on the closest point on the LOD sphere

primal shadow
fiery bolt
#

but it's buggy so I might need to revisit that lol

primal shadow
primal shadow
#

LZ4 was somehow doing a shit ton of work before considering before/after asset size with LZ4 compression applied is basically the same

#

But memory usage is nearly halved after

primal shadow
primal shadow
#

Confusing

primal shadow
#

Well this clearly didn't work

#

Much better, but I think it's vastly over-estimating error 😅

wispy spear
fiery bolt
# primal shadow

I think this is what I was describing that isn't always monotonic?

#

but which paper is that

primal shadow
#
fn lod_error_is_imperceptible(sphere: MeshletBoundingSphere, error: f32, world_from_local: mat4x4<f32>, world_scale: f32) -> bool {
    let cp_world = world_from_local * vec4(sphere.center, 1.0);
    let r_view = world_scale * sphere.radius;
    let cp_view = (view.view_from_world * vec4(cp_world.xyz, 1.0)).xyz;

    // TODO: Handle view clipping / being inside sphere bounds
    let aabb = project_view_space_sphere_to_screen_space_aabb(cp_view, r_view);
    let screen_size = max(aabb.z - aabb.x, aabb.w - aabb.y);
    let meters_per_pixel = sphere.radius / screen_size;

    return error < meters_per_pixel;
}

Not documented and poorly named atm, but this

#

Take LOD sphere, project to screen space to get the pixel size

#

Then you do sphere.radius(?) / pixel_size

#

I.e. if you sphere has radius 10, and your pixel_size is 4

#

you get 10/4 = 2.5

#

E.g. 2.5 meters = 1 pixel on screen

#

And error is already an object-space distance in meters

#

So now if e.g. error = 3.2

#

Well 2.5 megters = 1 pixel

#

So 3.2 meters on screen is greater than 1 pixel

#

I.e. visible error

#

So it's sufficent to check that error < meters_per_pixel

#

I.e. error needs to be less than 2.5 meters so that it's less than 1 pixel on screen

#

Since the relation between meters and pixel on screen is linear

#

I do need to handle clipping when inside the sphere bounds though. The paper covers it.

#

I'm not quite convinced on some of this though. And it feels really weird to project the LOD sphere and then compare the size to the simplification error, rather than projecting the simplification error directly.

fiery bolt
#

the screenshot you sent above mentions something about comparing error directly with distance * some threshold thonk

#

this code looks different

primal shadow
#

Yeah I didn't follow it

#

Because I have no idea how to handle the case where you're inside the sphere for that

#

The one from the batched multi triangulation paper

#

I also could not find code or the algorithm description for it at all, I'm giving up on that approach

fiery bolt
#

i can take a look and try to figure out what it is

primal shadow
fiery bolt
#

thanks!

fiery bolt
primal shadow
#

UE5 Nanite computation is similar in principle, but mechanically different - it takes into account cases where the sphere is clipped by znear, and it takes camera orientation into account, so the computation is not camera rotation invariant. Conceptually I think it's the same as your reference to fig 3.15, even though I find that specific figure odd as no lines or points there connect to sphere radius 🙂 they compute the projected sphere radius in pixels, invert it (that way they get the length that would project to one pixel, using the same coordinates that the sphere is in), and compare that to the error (which is also in linear units in the same coordinates that the sphere is in).

fiery bolt
#

hmm

#

from what i understand, it's the same thing you're doing

#

but the inversion is probably more complex than just a div

#

it could also be calculating the distance at which the error becomes less than a pixel, and just comparing the closest point on the sphere with that

#

but the distance for 1px error depends on where the sphere is thonk

#

so there could be some normalization step to convert from an off-center sphere to a sphere in the center

primal shadow
#

I have no clue

primal shadow
fiery bolt
#

he does say that perspective distortion exists

#

which makes it nonlinear

primal shadow
#

Ok lol so I forgot to multiple 0..1 by the view size 😅 , looks better now

#

Zeux also left me some more info, so I'm going to take a look at that too

primal shadow
#

Ok I'm just stealing zeux's code

#

I give up trying to understand this

#

Not sure I handled ortho correctly but yeah

faint crane
#

Is orthographic even useful with a Nanite implementation?

#

I guess reviewers will complain anyway, even if w=1.

primal shadow
faint crane
#

If only I could go the internet archive.

primal shadow
#

@fiery bolt what are you using for your error projection? Seems like you're using a method n don't understand based on the projected bounding sphere

fiery bolt
#

i have no idea tbh

#

i did geometry on paper for it

#

idk where it went

#

it's not correct tho

primal shadow
#

Hmm ok. Back to builder improvements.

fiery bolt
#

you should multithread simplification

primal shadow
#

No time for that, bevy release is very soon

#

Plan is to finish stealing zeux's error projection, steal some of the simplification improvements he had, write a hopefully faster and easy fill cluster buffers improvement, and then maybe fix SW raster if I have time

#

And then write the blog post for everything I did this cycle and help out with the rest of the release

#

Oh btw do you have a gltf -> virtual geometry mesh converter?

#

I'm curious how you're handling materials

fiery bolt
#

i don't have a renderer

#

it's just visbuf output and debug

fiery bolt
wicked notch
#

"forgot you had users" is the most GP thing ever

fiery bolt
#

no no we do it in the rust server too

#

specifically #games-and-graphics and # lang-dev

#

no that's not what i wanted discord

wicked notch
#

discord moment

wicked notch
#

they now belong to GP inc.

#

surrender or else 🔫 🐸

fiery bolt
#

for that you must shill rust cutecatNE

wicked notch
#

the first rule I'll implement is ban rust

fiery bolt
#

along with a healthy dose of offtopic cat gifs

wicked notch
#

that is allowed

fiery bolt
primal shadow
fiery bolt
wicked notch
fiery bolt
#

yes contrib to vcc

#

make it fully usable

wicked notch
#

I have been playing with vcc tho

#

without telling gob

#

I managed to make a transpiler

#

from clang to glsl

fiery bolt
#

i should try to re-derive or figure out the error proj math in my code tomorrow thonk

fiery bolt
#

why would you

wicked notch
#

because I didn't feel like reading the spirv spec

fiery bolt
#

the spirv spec is surprisingly good ngl

wicked notch
#

remember this is my first ever compiler bleaker_kekw

#

I know but I'm a compiler nub

fiery bolt
#

i'm currently on my like

#

6th

#

probably 6th renderer too

#

but i think i passed my final interview for an internship at e🅱️ic despite my serious lack of braincells and knowledge froge

#

just maybe

wicked notch
#

shit

#

we lost one

fiery bolt
#

nono it's for lang dev not giraffics

wicked notch
#

same thing, your soul will be sucked dry in exchange for monetary compensation

fiery bolt
#

only for 3 months

#

if i even get the offer that is

wicked notch
#

you will

#

you have brain

fiery bolt
#

it's been 2 whole days since my interview

#

they've never taken more than a day to respond KEKW

wicked notch
#

terminally online, just like us frfr

fiery bolt
#

fr

#

60 hour work weeks

wicked notch
#

damn

#

aren't there work laws or something where you're from bleaker_kekw

fiery bolt
#

i mean yeah it's biscuits and tea country

#

so it can't be america levels of bad KEKW

#

in return i will get 30% the pay tho

#

(there are several banks offering the same pay for software dev interns as tesco shelf stackers)

#

oh and tesco themselves actually

wicked notch
#

ye but 60hrs work weeks is insane

#

it's half here KEKW

#

(36)

faint crane
#

Work at NVIDIA at that point and take the comp.

primal shadow
#

Ok error projection finished. TODO:

  • Builder improvements from zeux
  • SW raster subpixel precision + top left rule
  • New fill cluster buffers
primal shadow
#

Alright, manual vertex lock time

glass sphinx
#

Iwant to do some lodding soon. I ll have to catch up here

#

how do you build the lod hierarchy levels?

fiery bolt
#

where nanite slides smh

glass sphinx
fiery bolt
#

meshlets in them should be touching or as close together as possible

glass sphinx
#

I read the nanite slides but i still dont understand the hierarchy building]

primal shadow
#

If you use my blog post, I wouldn't copy the code exactly, make sure to reference mine/meshopt's code for the up-to-date changes. I have a new blog post coming in ~2-3 weeks that will be up to date.

glass sphinx
#

is it N tris to N/2 tris?

fiery bolt
fiery bolt
#

and you keep track of error introduced

#

and set that as the parent error for the unsimplified group, and self error for the simplified meshlets

#

which reminds me i should write a blog post on error once i figure it out correctly KEKW

primal shadow
fiery bolt
#

and how i did the bvh

glass sphinx
#

how do you guys do the hierarchical culling? With persistent threads?

fiery bolt
#

nah, i use an indirect dispatch chain rn

#

it's fast enough

#

nanite uses a dispatch chain on pc too

glass sphinx
#

but didnt they say they dont?

fiery bolt
#

that's on console

primal shadow
fiery bolt
#

because it's technically UB

glass sphinx
#

yea but who cares

#

loool

fiery bolt
#

you because you don't wanna debug that shit

#

if even UE uses indirect dispatches, i will too

#

i might try persistent threads after i have streaming working

glass sphinx
#

workgraphcs cant come soon enough

fiery bolt
#

or just use workgraphs yeah lol

glass sphinx
#

sad that they have such horrible perf on nvidia

glass sphinx
fiery bolt
#

i wonder if i can abuse DGC to only conditionally insert an indirect dispatch

glass sphinx
#

you can do that with DGC

#

but you still need barriers i think

glass sphinx
#

oh

#

then this is all i want for now i think

primal shadow
#

Try their demo, they have a bunch of configurable stuff

#

Or download bevy and play around with it

fiery bolt
#

if i use coherent writes i don't think i'll need barriers?

glass sphinx
#

oh shit what, meshopt has a nanite demo

fiery bolt
#

yeah it's new

primal shadow
#

I linked it above 😛

glass sphinx
fiery bolt
#

i have an editor too

primal shadow
#

True, Bevy dosen't yet

fiery bolt
#

i need add a way to actually spawn meshes though lmao

#

you can only select and move things rn KEKW

#

oh and undo

glass sphinx
#

hm the demo doesnt seem to have a cmake target

#

i cant interact with it

wide shadow
glass sphinx
#

i think i might wanna try the hierarchical lodding at some point

#

but i definetly dont want the partial streaming part. Streaming full lods is way less headache i think.

#

i saw some analysis and nanite mesh streaming is a major cause for stutter. Needs much more pcie bandwidth then tex streaming for some reason

fiery bolt
#

nope

primal shadow
glass sphinx
#

rustyyy

#

ok i ll look at both your code's

fiery bolt
wicked notch
#

nah that's actually true

#

but only because UE's streaming code is dogshit

fiery bolt
#

lol

#

i doub mine will be any better

glass sphinx
#

kinda a shame. Their shader comp stutter is only getting worse with time too

fiery bolt
#

traversal stutter my beloved

#

you can make UE compile shaders on start though, can't you

glass sphinx
#

not really, no

#

they have made it fundamentally kinda impossible with their materials

#

have to pat the back of cod team on this one. I think cod engine is probably one of the least stuttery engines. Everything is prebuild and static, even the rendergraph.

fiery bolt
#

e🅱️ic should hire us to fix their stutter

glass sphinx
#

they dont care about such things

glass sphinx
#

their priority is best graphics at 30fps with 700p on consoles

#

kekw

wicked notch
#

maybe threat interactive was right all along

fiery bolt
#

should i do streaming or restir or should i fix my shitty code first

wicked notch
#

why not all three

#

at the same time

fiery bolt
#

do i like

#

write a line of code for each

#

then repeat

wicked notch
#

no

#

parallelize

#

what are you unreal engine that does everything on a single thread?

#

grow 4 more hands and buy two more keyboards and mice

fiery bolt
#

oh shit never thought of that

#

lemme try

#

brb

#

help this is me now

wicked notch
#

working as intended

#

keep going

primal shadow
#

I also have a partially written blog post on restir I never finished, I should do that...

frank sail
glass sphinx
frank sail
#

Idk what problem you need to solve

#

I thought you just needed a variable number of indirect dispatches

glass sphinx
#

for hierarchical culling youd ideally just start new threads in the same dispatch

frank sail
#

Ah hierarchy

#

Rip

glass sphinx
#

ah actuallynow that i think of it

#

Is it possible to start new dispatches immediately?

#

i think it has to flush and then do an execute indirect on the shader recorded command buffer

#

so it will have to run in passes still\

wicked notch
#

with DGC you would figure out the deepest level of the hierarchy and create the DGC commands

glass sphinx
#

you only know what meshlets you need after each cull phase

wicked notch
#

no you only need to test the error

#

which admittedly is a lot of work

#

actually yeah dgc doesn't solve the issue KEKW

glass sphinx
#

it would also be very poorly parallel

wicked notch
#

ye

glass sphinx
#

honestly dgc doesnt really do much at all imo

#

i dont really see a use for it in my things

wicked notch
#

DGC is really just a budget vkCmdDispatchIndirectCount

glass sphinx
#

omg yes

#

thats what ive been saying too

#

give me vkCmdDispatchIndirectCount

#

ffs

fiery bolt
frank sail
#

Reading the spec for dgc hurt me ngl

fiery bolt
#

same

#

I tried

#

and failed

wicked notch
#

why it's not that bad

frank sail
#

It reminded me of work graphs how you have to create a bunch of shit first

wicked notch
#

ye you have indirect token layouts and command tokens

frank sail
#

Or what feels like it from reading

wicked notch
#

which is a bit sad, but the layout only says which commands you're gonna use

#

and how many of each of them

glass sphinx
#

noone will use it outside of dxvk

#

hottest take

wicked notch
#

nah

#

mild take

wicked notch
#

only more convoluted KEKW

#

it does have the added benefit of being able to also issue draw commands

#

and pipeline state changes

frank sail
#

Budget vkCmdDrawIndirectCount froge_love

wicked notch
#

maybe this will make UE not issue a drawcall per material

frank sail
#

Press x to doubt

wispy spear
#

vkCmdSetIndirectScissorsCount when

primal shadow
primal shadow
wicked notch
#

issuing a drawcall per material?

fiery bolt
primal shadow
wicked notch
#

radiance cache

#

surely radiance caching is easy

#

:clueless:

fiery bolt
#

how about you catche deez nuts instead

primal shadow
wicked notch
#

ye you do the AMD memes

primal shadow
#

Getting your cache to bend around corners and crap sucks

wicked notch
#

with screen space cache and world space cache

primal shadow
#

While not leaking light

fiery bolt
#

just do whatever ReSTIR does

frank sail
primal shadow
wicked notch
#

I have this as my alibi

fiery bolt
fiery bolt
frank sail
fiery bolt
#

I cooked pasta today

#

did crimes froge_love

#

the spaghet wasn't fitting in my pot so I broke it before putting it in froge_love /s

wicked notch
#

don't mind me

#

I'm not calling in the military

#

you can stay safe in your house

frank sail
#

Lvstri after reading that

wicked notch
#

real

wicked notch
#

avoid looking in your walls too

#

for extra safety

primal shadow
#

Too much triangle cruft at the intersections still. Hopefully retrying stuck clusters in later passes works.

#

1 bunny = 1 meshlet though!

fiery bolt
#

so i seemed to have fixed my occlusion culling

#

well, almost

#

there just seems to be a band of under-culling at the edges for some reason

fiery bolt
#

ok i fixed that but now there's flickering but only for occluded things

#

wtf

faint crane
#

I fixed an occlusion culling bug today where my bounding spheres were wrong since I referenced i instead of j somewhere in a nested loop or something to that effect.

craggy shale
wicked notch
#

yes

#

I have been playing with shady a lot in recent times

craggy shale
#

love to hear it

#

hopefully recent commits didn't fuck your shit up too much

#

it's been refactor time for a while

wicked notch
#

o I haven't pulled in a while

#

yea

craggy shale
wicked notch
#

welp

craggy shale
#

yeah I'm renaming everything, moving files etc

#

sowwy

wicked notch
#

is ok

#

it's also on me for not checking regularly

craggy shale
#

but also having functions with super common name not namespaced whatsoever is a recipe for disaster

#

i'm prefixing everything with shd_ or _shd_

wicked notch
#

that's v nice

craggy shale
#

and shuffling headers arround

#

splitting ir.h in smaller more sensible bits

faint crane
#

I have a shader compiler called Slim Shady, but not the real Slim Shady, or death to Slim Shady.

craggy shale
#

the one who says, is

severe dome
#

is that the european version of "whoever smelt it dealt it"?

primal shadow
#

Finally figured out why my renderer broke with 1042 instances

#

I overflowed the 2^25 cluster limit...

#

That was so awful to debug

glass sphinx
#

is that cause you have all cluster instances always present

#

?

#

cause you should lod away most of them right?

#

or os that 2^25 after culling bleakekw

primal shadow
#

It's pre-lod/culling

#

Ideally this becomes part of the culling/lod pass and we only write out data for the meshlets that we intend to raster...

#

I need hierachal culling first. Also culling is the bottleneck atm, so for that reason too 🙂

glass sphinx
#

do you take contributions?

primal shadow
# glass sphinx do you take contributions?

Sure! I would love help, there's so much to do 😅 . I have a whole github issue on things I need to improve. You're welcome to take up anything. Just talk to me first, so that we're on the same page.

fiery bolt
#

hierarchical culling is great

#

I spend like a constant 0.4 ms on culling iirc

#

completely unoptimized

primal shadow
#

What kind of culling? Multiple dispatches? And what are your inputs/outputs?

fiery bolt
#

bvh frustum + occlusion + lod culling

#

and yeah it's a chain of dispatches

primal shadow
#

E.g. rn I have:

  • Fill cluster buffers: Input list of instances, write out clusters (instance + meshlet IDs)
  • Culling/lod: Input list of clusters, write out visible clusters IDs
fiery bolt
#

ah

#

it's pairs of instance and bvh node IDs yeah

#

and then instance and meshlet IDs to meshlet cull and output from meshlet cull

primal shadow
#

Gotcha gotcha. Thanks.

primal shadow
#

Sigh, maybe it's finally time to try and fix my occlusion culling

#

It's gonna suck to debug though

fiery bolt
#

yeah it does

#

the padding to 64 with granite's HZB seems to work btw

#

still not perfect though

faint crane
#

Or all of us here are cursed with slightly broken occlusion culling forever.

wide shadow
#

Count me in 😔

loud crag
#

my occlusion culling always the issue of culling small but visible triangles

wide shadow
#

mine functions as frustum culling bleakekw

fiery bolt
#

mine looks like it works but there's a very small border along the bottom and right edges that has less culling for some reason

faint crane
#

Mine is slightly not conservative when I disable Hi-Z.

#

Surely, we'll encounter enough bugs to converge on something which works?

glass sphinx
#

tho that doesnt scale over 4k

fiery bolt
#

but yeah it almost works but not perfectly

primal shadow
#

I'll have to reference your code

glass sphinx
glass sphinx
#

damn i did misinfo

fiery bolt
#

I wonder if the issue is due to me not passing the input though

glass sphinx
#

how does padding the output help

fiery bolt
#

so you have space to store the extra data generated due to NPOT

glass sphinx
#

but why does that work for 64 padd

#

why wont it break on the higher levels

#

those will still have the downrounded div mip sizes

#

this gave me an idea tho

fiery bolt
#

there's special handling for higher mips

#

it's goofy

glass sphinx
#

ok

#

rn i just scale the depth image to pot then downsample single pass

#

but it feels dirty

fiery bolt
#

yeah

#

I might just resort to that ngl

glass sphinx
#

it does make everything very simple tho

#

also same speed, just bandwith limited

primal shadow
glass sphinx
#

it is conservative

#

why wouldnt it be

#

in the downsampling to pot each pixel reads 2x2 pixels of the original image. Read is using uvs so it should map as long as the pot image is not less then half the size

#

now im paranoid

fiery bolt
#

the issue lies in mapping coordinates from the screen to your scaled pot

#

because the mapping differs for each mip

glass sphinx
#

the culling is using the pot image dimensions

#

i just scale 2560px1440p -> 2048x1024p for example with a 2x2 filter for each pixel and then downsample. The culling then uses the pos image dimensions

#

the culling doesnt need to use the screens dimensions. The mapping of pot image mips to original image potential mips doesnt matter after its downscaled

fiery bolt
#

how do you map from screen pixel to hzb pixel?

glass sphinx
#

you mean in the downsampling?

fiery bolt
#

no, culling

glass sphinx
#

i dont have to

#

why would i

fiery bolt
#

how do take NDC AABB and sample from your hzb

glass sphinx
#

the culling uses the hzb dimensions

#

i calculate a ndc for hzb pot size

fiery bolt
#

oh you're... scaling down NPOT?

glass sphinx
#

?

fiery bolt
#

i just scale 2560px1440p -> 2048x1024p for example with a 2x2 filter for each pixel

#

this is definitely wrong

glass sphinx
#

what, how?

glass sphinx
#

it doesnt matter at all what dim the culling tex has

fiery bolt
#

yeah if your hzb is scaled so that the entire hzb maps to the entire screen

#

but how you do the mapping seems very wrong

#

do you have an overdraw debug view?

glass sphinx
#

you seem to have a very flawed image of how i downsample

#

reading 2x2 doesnt mean there is an offset of 2 pixels for every out pixel

fiery bolt
#

then what does it mean

#

do you just throw a min sampler and sample from UVs

glass sphinx
#

you calculate the uv in the dst image, then gather in the original image and max/min (depending on depth dir) them all

#

i have made many debug visualizations and tested many cases and the culling never breaks from what i can tell

glass sphinx
#

i also had a visualization that draws the ndc for each culled object to see if its overculling (as visible ndc would mean if culled something thats actually visible)

#

but its not much to show as you just see no ndc 😮

#

this is culling off

fiery bolt
#

hmmmm

glass sphinx
#

was a massive help to fix it initially

fiery bolt
glass sphinx
#

xD

fiery bolt
#

I should do some stats though thonk

glass sphinx
#

the debug utils in tido make up like 10-15% of all its code or so. But its so nice to have that stuff

#

turbo bikeshed

#

but also sanity

fiery bolt
#

debug utils are insanely helpful

glass sphinx
fiery bolt
#

like, one of the most important things when debugging

glass sphinx
#

i ll polish up the debug draws for culling and show them later

fiery bolt
glass sphinx
#

ooooh

fiery bolt
#

idk why

#

still tryna fix that

#

the weird border and flickering are the two bugs left

#

works great otherwise

#

oh yeah also sw raster

#

just a wee bit broken

#

in that it instacrashes when enabled

glass sphinx
#

i think it took me a few months of random insights to fully fix everything

#

i had a few vey hard thinking mistakes

#

im stopping ymself from doing sw raster

#

too much bikeshed went into the rasterization

#

its time for cool visuals now

wicked notch
fiery bolt
#

no

#

i build entire nanite

#

every component is slightly broken

#

then i try to debug

#

oh also i got the e🅱️ic internship froge_love

#

time to slave writing code for money but full time now

wicked notch
#

pog

#

send some here to finance my stupid decisions (like buying the upcoming intel cpus)

fiery bolt
#

they're gonna burn themselves up

#

while being slower than the 14900k

wicked notch
#

intel pinky promises that these ones are safe

fiery bolt
wicked notch
#

I didn't preorder or anything just to be safe

#

I'll wait for buildzoid to post his usual rants

fiery bolt
#

lol

#

you should give amd your money instead

wicked notch
#

I considered that

#

then I looked that they regressed on memory latency

fiery bolt
#

buy zen 4 tho

wicked notch
#

which was already abysmal

fiery bolt
#

zen 5 is just not worth it

wicked notch
#

ye

fiery bolt
#

buy a 7950x3d froge_love

wicked notch
#

zen 5 is a mess

fiery bolt
#

or an 8950x3d if it has cache on both ccxs

wicked notch
#

I mean

#

have you seen the core to core latency

#

on zen 5 it's sometimes faster to read from ram than from cache (in another ccd)

fiery bolt
#

lmao wtf

#

is this after the agesa patch they did

wicked notch
#

ye there was a graph somewhere in #hardware that was absolutely funny

fiery bolt
#

i think that was before the patch

wicked notch
#

turns out T_cache + T_fabric >= T_ram KEKW

wicked notch
#

lemme find some info

primal shadow
#

@fiery bolt for the BVH, what do BVH nodes equal? Some kind of grouping of clusters? Or of cluster groups? The nanite slides are vague on how the BVH is setup. Also how they enforce only 8 clusters per node.

fiery bolt
#

seems to work well enough

#

might wanna read the code tbh

#

and ask questions based on that

primal shadow
#

My software raster is broken and idk how to debug it 😭

fiery bolt
#

literally me

#

but mine crashes the GPU and I have no idea why

wispy spear
#

can you use novideo aftermath?

fiery bolt
#

and only when I touch groupshared mem

#

which is agonyfrog

wispy spear
#

did you ask in the more public channels already?

primal shadow
wispy spear
#

but these meshlets look neat and make me jealous

primal shadow
wispy spear
#

i am mentally not capable yet unironically

fiery bolt
#

i have successfully completely broken my error projection froge_love

#

i'm also somehow crashing with an MMU fault when i shrimply index my output storage image with SV_Position.xy

#

how does that even happen

wicked notch
#

bro is finding bugs not even the hw knew it had

fiery bolt
#

fr

#

that can't be real

#

aftermath must be tripping

primal shadow
#

Whooh, fixed my SW rasterizer!

#

Had to force the HW rasterizer for near-clipped clusters, and add backface culling to the SW rasterizer

wide shadow
wicked notch
#

zeux massive as usual

faint crane
primal shadow
#

Probably the roadmap actually, software VRS is something I want to experiment with

primal shadow
#

I updated to meshopt 0.22 and factored normal error into the LOD selection, so much better now!

wicked notch
#

Here's a fun fact from my days as an professor assistant

#

so I was correcting the practical exams of second year students for DSA

#

the tasks were

  1. Create a graph data structure, load the nodes from file and implement both DFS and BFS visits
  2. Write an algorithm to find the longest cycle in the graph
  3. Write an algorithm that determines whether there is a Hamiltonian cycle in the graph
#

now here's the funny part, the third task is NP complete but the professor somehow missed it KEKW

#

most of the students were able to just write the bruteforce algortihm with backtracking, however the our uni's computers are extremely outdated and an n factorial algorithm isn't exactly the fastest

frank sail
#

I was asked multiple times to solve np hard problems in college. A lot of them are easy, you just gotta brute force it

wicked notch
#

real

#

anyway this wasn't supposed to happen but oh well

loud crag
#

funny words

fiery bolt
#

our uni computers have 7900xs... but the intel kind, and they just upgraded them all from 3080s to 4070 ti supers????

#

like pls can we get new CPUs too

minor root
#

(It was an intentional trick question but still lmao)

wide shadow
wispy spear
#

needs a social preview image 😛 (its in the repo -> settings -> "social preview")

fiery bolt
#

i replaced shrimple 1 buffer -> 1 indirect queue with a complex dequeue and that fixed my software rasterizer?

wispy spear
#

nice

#

time to blog about it : >

fiery bolt
ebon ruin
#

Okay Nanite general

#

how come Bevy uses sphere bounds for cluster frustum culling?

#

It already has AABBs that it uses for occlusion culling

#

is the perf difference that big?

fiery bolt
#

AABBs are usually better than spheres

#

but you need a sphere for error projection bounds

#

I use an AABB for frustum and occ cull, and spheres for error

glass sphinx
#

spheres are faster for testing frustum

fiery bolt
#

yes but AABBs usually lead to better culling efficiency

#

and that's worth it

glass sphinx
#

im not convinced it is

#

show code, maybe you have some clever way to make it faster than what i did

#

From what ive seen so far, the culling gains from it do not justify the extra overhead for higher meshlet counts

delicate rain
#

We need obbs

#

Ups not Tido thread

fiery bolt
#

do note that I have BVH culling, not just instance -> meshlet

wispy spear
#

cute pfp : )

glass sphinx
#

@wicked notch i reimplemented prefix sum + binary search now with devshs trick

#

its much faster than po2 buffers

wicked notch
#

damn

glass sphinx
#

the overhead of many dispatches is massive on nvidia

wicked notch
#

that's op

glass sphinx
#

honestly crazy to me

#

devsh made the point to me that the draw order is fucked with po2

wicked notch
#

should we report that though

glass sphinx
#

i think thats a big reason

#

also

wicked notch
#

idk it feels like a driver issue

glass sphinx
#

idk their frontend always was bad

#

also im not sure but my new binary search is much faster than my old one

#

idk why i did my old one so badly

#

now my mesh shaders reach much better occupancy

#

i have the suspicion that multiple mesh shader dispatches have high overhead and cant share resources well

#

tyhe isbe memory might be contested between many dispatches

wicked notch
#

huge findings

#

I'll do binary search too then

glass sphinx
#

i give you the code

#

this will make vsms much much faster i think

#

nuking their overhead

#

they do 32x16 dispatches atm

#

that will go down to just 16

#

but this means that mesh shader with task shaders have much larger overhead for launches

#

than normal draws

#

but still way better than compute

#

kinda a middle child

#

i can kinda see that the dispatches start after each other

#

the later buckets are emptier

#

and with bucket launch the later part is just kinda empty

primal shadow
fiery bolt
#

RenderDoc does not actually show results from the GPU, it's all simulated on the CPU
pretty sure it replays commands, no?

#

if it was all run on the CPU it would be incredibly slow

#

and it also dies if i device lost in the capture

frank sail
#

it uses transform feedback to get vs output, for example

#

I think the only thing it simulates on the cpu is shader debuggin

fiery bolt
#

yeah

#

and shader debugging kinda sucks as a result

#

because other waves and workgroup threads are shrimply not simulated

primal shadow
#

Let me remove that part then, thanks. Not sure why my output was changing every time I clicked the dispatch then.

#

I guess even if it was running on the GPU, renderdoc kept re-simulating it

fiery bolt
#

yeah it probably replays the entire command stream when you go backwards

#

because making a copy of each buffer for each command would eat up a lot of mem

#

looks good otherwise though froge_love

frank sail
#

there are many things that can trigger renderdoc to replay the frame

primal shadow
#

New text:

Debugging the issue was complicated by the fact that the rewritten fill cluster buffers code is no longer deterministic. Clusters get written in different orders depending on how the scheduler schedules workgroups, and the order of the atomic writes. That meant that every time I clicked on a pass in RenderDoc to check it's output, the output order would completely change as RenderDoc replayed the entire command stream up until that point.

frank sail
#

like clicking on a different event

primal shadow
fiery bolt
#

that's some major improvement with sw raster

#

what's the test scene?

primal shadow
primal shadow
fiery bolt
fiery bolt
#

you should try the lucy scan

primal shadow
#

Unfortunately v0.14 can't render higher counts due to how I handle allocating some buffers

#

It runs OOM

#

So I need a test scene I can use on both

fiery bolt
#

oh rip

primal shadow
#

I'm going to use the megascan cliffs and show perf for that too in 0.15

fiery bolt
#

lucy is also great for testing out import perf and how good your simplifier is

#

because there's just so many tris

#

28 million iirc

primal shadow
#

Maybe for 0.16 😛

#

I don't have it setup

fiery bolt
#

lol

#

you'll need to multithread generation too

#

it takes me 18 minutes to import on an (amd) 7900x bleakekw

#

with all cores being hammered throughout

primal shadow
#

Ah. I have a 2600...

#

So I'll try that in 0.16 😛

fiery bolt
#

probably a good idea yeah

wicked notch
#

boys

#

we did it

#

time to party

wide shadow
#

Collage is over?

wicked notch
#

college part 1

#

part 2 is the real stuff

wide shadow
#

Don't say masters bleaker_kekw

wicked notch
#

yep

frank sail
#

let's go

#

I think I had a reminder set for this day

#

where are you doing your masters?

wide shadow
#

How much of free time until you start masters

wicked notch
wicked notch
wide shadow
#

Damn that's quite a lot

wicked notch
#

ye I've decided to take some time off to avoid actually dying KEKW

cunning solstice
#

year is actually quite short

#

I blink and it's over

wide shadow
#

Don't ever start on that in 5ish months I have school leaving exams

glass sphinx
#

very cool

#

you studied agriculture and animal welfare?

wide shadow
wide shadow
#

Basically first and second semester of collage

#

You get all the jazz about CPUs, how memory works, electrical engineering, programming, databases and operating systems

glass sphinx
#

i was betting on lvstri to care for my cows

#

fyck

fiery bolt
wicked notch
#

what is the nanite uni

loud crag
#

#bikeshed-😇

glass sphinx
#

lmao

fiery bolt
primal shadow
#

Blog post for bevy 0.15 meshlet stuff is almost done

#

I kinda wish I scrapped the idea and had just done a blog post or two on some specific parts, instead of everything. It's kinda a big mess of a post, but too late to change now...

#

I think the memory compression section came out well, but not so much the rest

primal shadow
#

@fiery bolt BVH for nanite = internal nodes point to cluster groups, and use AABBs based on cluster LOD spheres, and then leaf nodes point to clusters? Or is that wrong?

primal shadow
wispy spear
#

@wicked notch congratulazzione

fiery bolt
fiery bolt
#

so you're gonna be expanding a lot more nodes

primal shadow
#

then build a BVH of the root nodes (no SAH because the AABBs are gonna be the same)
What does this mean? Litterly just take 8 random nodes, group, and repeat until you have a single root node?

#

I.e. if you have 16 LODs, pick 2 sets of 8 randomly to group, and then group the two sets once more

fiery bolt
#

sorting by error is probably a better idea now that I think about it lol

primal shadow
fiery bolt
#

yeah tru

#

even lucy only has 13 LODs

#

so we don't really add too many levels

primal shadow
#

I made this diagram to explain how meshopt's LOCK_BORDERS flag works (#2)

#

And then I realized I have no idea how it works

#

So guess I'm not using it and just gonna skip explaining it lol

#

No idea how it preserves the meshlet borders if it's just going off of the topological border

primal shadow
primal shadow
#

I have 0 motivation to do BVH culling after I spent so much time on virtual geoemtry for bevy 0.15 😬

#

Guess I need to take a break

primal shadow
#

@fiery bolt for your BVH, for interior nodes, what bounding sphere do you use to project the error?

#

Your leaf nodes are cluster groups, with error = parent group error, and bounding sphere = parent group bounding sphere

#

And when building an interior node over those leaf nodes, you set error = max error of leaf nodes

#

But what do you set the bounding sphere to be?

#

A new bounding sphere enclosing all the leaf node bounding spheres?

#

(btw it's confusing because you use the DAG parent group LOD data, which is different than the BVH parent lol)

fiery bolt
#

it's just all the child BVH nodes' lod spheres merged

primal shadow
#

Thanks! This shouldn't be too bad to implement then. Just very confusing, because there's both DAG parents and BVH parents 😅

primal shadow
#
struct BvhNode {
    child_start_id: u32, // If meshlet, is meshlet ID, else is pointer to BvhNode
    child_count: u16, // If u16::MAX, then node is a single meshlet, else is BvhNode child count
    error: f16, // If meshlet then is group_error, else if is lod group then is parent_group_error, else is max of child's parent_group_errors
    bounding_sphere: vec4<f32>, // If meshlet then is group_bounding_sphere, else if is lod group is parent_group_bounding_sphere, else if new bounding sphere enclosing all children bounding spheres
}

😅

fiery bolt
#

what I do is SOAify my BVH into a BVH8, so a u8::max is a single node child, otherwise it's meshlet count

#

('single node' => actually holds data for 8 nodes)

#

this also reduces queue memory size by 8x

#

probably the main reason I did it tbh

primal shadow
fiery bolt
#

not entirely sure how that would work

primal shadow
#

Each meshlet needs a bounding sphere and error anyways

fiery bolt
#

but also a tight cull sphere and other metadata

primal shadow
#

Because after the lod group check against parent error, you need to check the meshlets self error

primal shadow
fiery bolt
#

ah ok

#

if it works, it works lol

primal shadow
#

Mhm

glass sphinx
#

btw what is bvh culling?

#

isnt the dag alreay a tree that can be used to cull?

primal shadow
glass sphinx
#

hm

fiery bolt
#

they reconverge

#

so DAG traversal is more complex, as you have to ensure you don't revisit things

#

since we can already do LOD selection in parallel, we don't really have to follow the DAG, we can use any structure to accelerate it

#

thus, the BVH

wicked notch
#

graph is hard to visit

#

bvh is ez

fiery bolt
primal shadow
#

@fiery bolt are you doing frustum + occlusion culling in the same kernel as the LOD traversal, or do you do hierchal LOD traversal, write all the meshlets to a buffer, and then do culling?

fiery bolt
glass sphinx
#

im confused

#

you have to visit the dag anyway as you instantiate meshlets, no?

#

what is the dag for at all if its not used?

fiery bolt
#

the DAG exists during build

#

but you don't have to visit it, since LOD decision can be localized if you store current and parent error

#

so now you can just build a BVH out of parent error to accelerate stuff

glass sphinx
#

hmmmm

#

i see

#

but arent the parents spawning the childrens work?

fiery bolt
#

yeah, from the BVH

glass sphinx
#

i dont get it

#

its not a bvh tho its a dag

#

like what does the dag do then

#

at build time

fiery bolt
#

you're converting a DAG to a BVH

#

at build time

glass sphinx
#

is that possible

#

huh

#

i dont have intuition for that at all

fiery bolt
#

the DAG isn't really important tbh

#

it doesn't really 'exist' at build time either

#

there's no data structure explicitly storing it

glass sphinx
#

i see

fiery bolt
#

it's just implicitly there with how groups relate to each other

#

but since you only care about the current group and parent group

#

you can BVH-ify it

glass sphinx
#

ok wait

primal shadow
fiery bolt
#

AABB

#

but yeah

#
public struct BvhNode {
    public Aabb aabbs[8];
    public f32x4 lod_bounds[8];
    public f32 parent_errors[8];
    public u32 child_offsets[8];
    public u8 child_counts[8];
}
glass sphinx
#

@fiery bolt when a child finds out its parent is instantiated it kills itself?

fiery bolt
#

no

#

the parent doesn't spawn the child

primal shadow
fiery bolt
#

SOA

glass sphinx
fiery bolt
#

so i can have one index in the queue for 8 nodes

primal shadow
#

I see

fiery bolt
#

nanite also does a BVH8 iirc

primal shadow
#

So you basically just build a BVH, and then inline the 8 children into each node?

#

and then the children start/end become what?

glass sphinx
fiery bolt
#

count only matters for meshlets

fiery bolt
#

if you just expand to all meshlets

glass sphinx
#

no

fiery bolt
#

read jasmine's first blog froge_love

#

then ping me and i'll tell you how the BVH works