#Foundation - My adventure through Graphics Programming
1 messages ยท Page 2 of 1
I use task graph for really long time
nice
๐
are already added the "concurrent" use enums for it
they also work already if you want to overlap tasks that use the same resource
nice. I have sw raster so compute async for it would be nice ๐
finally
my texture streaming is now instant
with 32 threads it should be XD
im choping
I went for 7950x because it had nice discount like most things that I got
9800x3d looking really nice but I want 32 threads with dual cache ๐ญ
this upgrade is insane 7700HQ -> 7950X and 1050M -> 4080 super
then you should have gone for some epyc
Btw how many total additional shit did you buy before it worked?
....
Its a serious question, deccer
I'm finally upgrading from a 2600 to a 5800x, if I can get a used 5800x cheap off ebay
nice
it was dead cpu
i got my 5950x off ebay too, with mobo cooler and 32gb ram, for 320 european pesos, but i got a 64gb kit new
ah so 2x 7950x ๐
yeah
so in total 2x 7950x, 3x motherboard and 5x ram kits

summer job money coming in clutch
also my microphone is waiting for me
What did you do? Built custom pc's from spare parts?
no a lot worse
php dev
not really. I was doing a lot of plumbing for their mono repo
docker, configs, databases and php + symfony bullshit
Also I learned to use mysql to code gen sql for me 
How did you get the job? Looking at local job market and its just destroyed(
first job I got through friend who is designer he worked for one company and they needed a programmer. I was hired on spot basically. second job I got because my high school has mandatory internships in 3rd and 4th grade. I got there through friend they have never done internships before. After 14 days he sat me down asked bunch of questions thats were he learnt that I am high schooler and got a job there
They are expecting me after I graduate HS
before that I worked from my 15 years bunch of other jobs
I am lucky bastard
first job my github helped a bunch
Yeah, just wanted to note that
I mean its not just a luck but a lot of work also
In my high school there is day of companies(rough translation) where like 60 of them to our school and try to recruit us
My college doesnt offer internships) Hopefully uni will offer something
same here. We look for them
They say you have to have internship and dont offer some. So you gotta look on internet
So if you dont have one they just kick you out?
got even clang to compile my project 
mesh shader time
king time
somebody whined about mesh shaders being slower than ordinary shit the other day
where?
i think you mighjt confuse them with task shaders?
yeah
for sure
If I remember correctly you moved shit from task to compute?
I want to try implement my own work graphs
vk spec my beloved
i have both
i can switch
i am trying to accesss work expansion perf still
๐
oh boy thats a lot of bikeshed
yea it is
people in academic should be banned from writing code
because oh god that shit is unreadable
I found some DGC code bit disappointed since you cant really execute commands in same way as work graphs its more less indirect dispatch/draw call but with more flexibility
VK_INDIRECT_COMMANDS_TOKEN_TYPE_DISPATCH_EXT ๐
what the actual fuck
They removed support in newer driver version? 
or is this beta
anyway
afaik this driver is a rollback due to a securety bug
see the vulkan version decreasing as well
Oh completely forgot about that
fastest running windows server 2008 R2 
why
For school 
Also I am forced to write documentation
I dont want to talk how we were forced to use MS DOS
why?are you studing archaeology?
sadly I am studying HS computer science
horse science
FOR THE EMPEROR
why 80% of time is spent in resolve vis buffer shader 
Found the coulprit stupid atomic ops
const f32 mip_map = log2(max(length(uv_grad.ddx * 65536u), length(uv_grad.ddy * 65536u)));
const u32 wave_mask = WaveActiveBitOr(65536u >> u32(max(0, mip_map)));
if(WaveIsFirstLane()) {
InterlockedOr(push.uses.u_readback_material[mesh.material_index], wave_mask);
}
What's with the wave ops?
what do you mean exactly?
What is this pass doing?
its resolving vis buffer to color image + writes which texture size is needed for material
Oh, is that for your texture streaming?
yep
Makes sense now
the WaveIsFirstLane should reduce to amount of atomics to just one per subgroup if I am correct
I think it could be also cause by that buffer being host mapped so I can do readback
I will fix that and see if it improves
This weekend I want to add mesh streaming
forking daxa for debug info
nsight is just stupid 
slang debug info doesnt work
add -g2 to the cmd line
i have it in a commite but didnt submit
I added g2 and still not worky
yea nsight has 0 slang support
makes no sense cause it doesnt need anyrhing
it should just eat it but doesnt
g2 makes it work in aftermath crashes tho lmao
terrible readback performace has been fixed 2.6-2.2ms -> 0.4-0.6ms
I hate culling
Why tho
I just fixed it. It was so stupid
Mesh streaming is in
I perform frustum + hiz culling which is quite strict and I am evicting meshes after 255 frames
In whole scene there is over 1600 meshes
so
with a 240hz monitor
when i do a 360 ๐
i see all sorts of plopping and loading
this would be fixed with faster decompression and less strict frustum culling and maybe getting rid of hiz culling
also distance based culling
so some meshes around camera would be kept
For example you dont see texture pop in for texture streaming because their is much less of them and I still keep 16x16 px in VRAM
It also might be creating buffer which is slow it can take up to 10 ms
Also you gotta consider that I am rendering this in 0.5ms so meshes get evicted really quickly ๐
perhaps this will be depending on what kind of game you make with it
a car racing game which is rather linear you can just evict asap
with some openworld fps nonsense not so fast
openworld rts with some isometric camera faster again et al
made it "instant" with caching and running it in release build 
so much threads
Funny thing this would be faster(my guess everything would be loaded in 1 frame) if threads werent too slow to pick up the jobs
thanks to task and mesh shader I need to basically change all my code
also tragic embed fail
I just hit a slang compiler bug ๐
this is the code that makes slang explode
groupshared u32 hw_meshlet_count;
[shader("amplification")]
[numthreads(32, 1, 1)]
void draw_meshlets_task(u32vec3 local_thread_index : SV_GroupThreadID, u32vec3 global_thread_index : SV_DispatchThreadID, u32vec3 group_index : SV_GroupID) {
const u32 count = push.uses.u_meshlets_data->count; // reading from pointer in push constant
if(local_thread_index.x == 0) {
printf("%u\n", count); // commenting this make validation layer error disappear
// also commenting makes validation layer error disappear.
// printf or atomic code below has to be commented in other to not crash
u32 i = 0;
InterlockedAdd(hw_meshlet_count, 1, i);
hw_meshlet_count = i;
}
GroupMemoryBarrierWithGroupSync();
}
struct Vertex {
f32vec4 sv_position : SV_Position;
};
struct Primtive {
nointerpolation [[vk::location(0)]] u32 vis_index;
bool cull_primitive : SV_CullPrimitive;
};
[outputtopology("triangle")]
[shader("mesh")]
[numthreads(1, 1, 1)]
void draw_meshlets_mesh(
in payload MeshPayload payload,
u32vec3 local_thread_index : SV_GroupThreadID,
u32vec3 group_index : SV_GroupID,
u32vec3 global_thread_index : SV_DispatchThreadID,
OutputIndices<u32vec3, MAX_TRIANGLES_PER_MESHLET> out_indices,
OutputVertices<Vertex, MAX_VERTICES_PER_MESHLET> out_vertices,
OutputPrimitives<Primtive, MAX_TRIANGLES_PER_MESHLET> out_primitives) {}
[shader("fragment")]
void draw_meshlets_frag(in Vertex vertex, in Primtive primitive) {} ```
report it ๐
updated slang and its gone 
I am fighting task graph allocator
@supple cliff Is there some stupid way to make the buffer larger?
found it
yes in constructor\
how many trillion of verticles is this?
this many 44 224 440 125 vertices
0.0044224440125 trillion vertices
I have 24 MB buffer for each bistro so multiply that by 125 bistros
how much memory it is
4080 has 12? or 16? giggerinos of vram
16
also other thing that I need to fix is my task graph allocator
remove dependence on it
also the memory budget textures vs geometry is insane 
I dont why adding buffer_ptr to task makes engine crash in submit
You can compress meshlets very easily
put 3 more gpus in the system
: D
hiz is broken again 
the heck
git stash my beloved
Okay now I need to optimize CPU side
oh boy I am iterating over 154k entities ๐
easy 70ms of frame time gone
added special flag for root entities
and now transforms
what impl are you using for them?
for the iteration I mean
I don't think 154k is a lot for ecs
I am using flecs. I was iterating over each entity in query and checking if it doesnt have parent
that costed me 60-70ms
I tried the flecs way but It looks like its sameish
So I went for special flag which changed that from 60-70ms -> 50us
yes that's how I store parenting info too
the hierarchy is another set of components where entities have a HierarchyParent component
but it is strange that flecs doesn't provide this out of the box
flecs has its own. imo its kind of weird
it does
this is how you assign parent to child
this is how you iterate over children
query_transforms = world->query_builder<GlobalTransformComponent, LocalTransformComponent, GlobalTransformComponent*>().term_at(3).cascade(flecs::ChildOf).optional().build();
this is for iterating over parents and cascading to children
something must be wrong, 70ms is enormous for hierarchy traversal
yeah
well eventually you'll want a develop build with optimizations on, but do check what flecs docs say about this
and if it requires optimizations to be on to be fast
now its also fast in debug so I am not complaining 
yeah but if all you want is to see which are the root level entities, then yeah you solved it quickly
but what if you want which are the entities that are two levels deep within their hierarchy that have a particular property?
you're back at 70ms
in general it doesn't make sense to benchmark debug because it reflects nothing real
I am not planning to turning this into game engine or game in near future. So jank solutions until then
it certainly doesn't reflect the performance your user will see
the code will be changed so much by the optimizer (the assembly I mean) that it won't resemble debug at all, most likely
yeah. I got quite big uplift for streaming
these are the components that I use for my hierarchy
struct HierarchyParent
{
SceneEntityId ParentId;
};
struct HierarchyChildren
{
std::vector<SceneEntityId> Children;
};
struct IsRootInHierarchy
{
};
I don't have 154k entities for sure tho so I never benchmarked it
it's a bit of work when the hierarchy changes but iterating afterwards should be reasonably quick
I had something simillar with entt. flecs has its own way and I am lazy to learn it. It took me a while to query all parent and childrens transforms
Todo list for me:
- pack transforms and readback for meshes
- make mesh shaders faster
- try nanite lods

No clue how I will select the correct lod
by area perhaps
thats the easy part
but you need to make sure you dont select upper lod level or lower(basically rendering 2 lods meshlets just overlaying each other)
a little overdraw hasnt hurt anyone yet :>

ignore me please, i have no clue about any of this advanced tek
people in #1090390868449558618 know probably or have ideas
I will try to figure this thing. It is going to be fun
first I will get the lods on the screen first
Also I hope nvidia devs will bless us with nsight update ๐
Yeah, building the LOD is very difficult to get good results. I still have a lot of stuck triangles a lot of the time.

No clue why going out of bounds on gpu doesnt prompt crash for me....
perhaps robustness is on/off by default?
Today move the rest of code to mesh shaders and hopefuly do some optimization.
Then ๐ฅ ๐ฅ ๐ฅ ||nanite lods||
i think you would like my gif collection
So I moved code to mesh shaders. It was pain to do so. I was hitting some edge cases
Now finally meshlets lods
Gotta do a lot of reading
also got this baby
I am thinking of taking pause on this project and try hw rt
daxa rt?
yes yes
cool cool
the shader binding table stuff is still very akawrd
but we couldnt find a good abstraction yet
jaisero worked on it a while but i didnt check on it
So i am eager to see what you come up with
yeah I was looking at shader binding table and it was ify. I have some idea but idk
@supple cliff My idea to get rid of indices is to wrap it into objects that would hide the index and make it readable.
This the idea I got without much going in depth
I might have shot myself into foot with offline asset baking and some assumptions with meshlets
also I unload meshes from VRAM so another bullet hole in my foot
quite chonker
is that your current project?
yeah if you render your scene in 0.2ms and you are sending 500mbps it fill ups quite fast
this is from tracy and not RAM usage of my project ๐
it shows the amount of data that tracy received from application that is being profiled
when you really want to make sure you get 60fps even after your game runs for 100 hours
i see
much better
I do cache in wg memory and the scene in question
its 2544037 meshlets in question to be sw rendered 
both in early and late pass
Only thing to help me would be LODs
Does nsight have anything that tells you which variables are living long?
It just tells you number of live registers, but I don't see how to turn that into something actionable
I will be able to check in 10 hours after school 
just registers
it shows what register usage certain lines cause
I am wondering if moving this snippet of code to if statement(local_index == 0 or wave is first line) and caching those variables in the wg memory would help
I've got no clue
This is the first time I am able to profile the frame and see shaders
made it bit better
now this is the worst offender
I made other change which cut it by 0.5ms in total. But these changes are quite incremental. The biggest problem is how many meshlets are rendered in first place
I found easy trick to compute normal matrix on gpu
and its dirt cheap
cross(transform_matrix[1].xyz, transform_matrix[2].xyz),
cross(transform_matrix[2].xyz, transform_matrix[0].xyz),
cross(transform_matrix[0].xyz, transform_matrix[1].xyz)
);```
My gpu transform info went from 128 bytes to 84 bytes after using 3x4 matrix and 3x3 matrix. Now it can be just 3x4 matrix(48 bytes). After seperating position, scale and quaternion from matrix it can be just 40 bytes
lods are borked 
I had some big chunks missing
it seems it really hates roofs of bistro since they are made just out of one meshlet
cooked
how does overdraw l ook like?
honestly I wonder too 
luckily i know a tool which can visualize that ๐
I do sw raster also so it wouldnt help 
why the buffer memory usage is high, when for images itโs low af?
How do y'all deal with small meshlets/triangles getting stuck and never getting simplified?
Like meshopt does a great job of evenly splitting things up into meshlets for the first LOD
But then the next level it gets kinda screwed up, and it often dosen't simplify out
its a lot of data you are looking at 125k meshes and I allocate the worst case
only choice is to cry in corner
one leaf is one meshlet 
that dark red spot is that stupid bushy tree neh?
yeah 
bistor bad
yeah
nope
sw raster is jugging destroying performace so much
only good thing is I finally got lods in
if only that could be hardware acceleratered ๐
so I can focus on performace
its an oxymoron
I know
. It was to my previous message
and writing thesis with studying for final exam
thesis and the final exams are in few months
so its fine
they are in may
so many changes
@small osprey I'm thinking about it, and how does a seperate BVH tree per LOD level make sense? Wouldn't all the groups in a LOD level have similar errors?? What's the BVH really accelerating here?
Frustum/Occlusion culling, maybe?? Idk
Wouldn't all the groups in a LOD level have similar errors??
yeah you want that so you can find the cut. if you had widely varying errors next to each other in the tree, you'd traverse using only the min of them, so it would be a deeper traversal
But like what's the point of a tree then?
Why not just group meshlets by LOD, and then have like 1 two level tree where you select a LOD level, and then look at every meshlet in the LOD
you don't always select only one lod
that's discrete lods again
remember, we have similar world space errors, but they're not similar after screen space transformation
Yes, but you can still have a 2-level tree
it depends on absolute error as well as bounds
I don't get how you're accelerating anything by having a multi level tree
because the value you're traversing on depends on abs error and bounds
I'm not getting it :/
right, and?
so you can discard huge parts of the lod-bvh if a node high up has too much error
say you have a lod transition halfway across the mesh from the camera
you will traverse the half of the lod-bvh for lod n, and half of it for lod n + 1
if you had a two level tree you'd traverse two whole lods
now imagine you have 6 lod transitions because it's a massive mesh
with a two level tree you'd go through 6 lods worth of meshlets
with a bvh it's roughly similar
i think atleast
i just did it because that's what the slides said they did lmfao
ok, so
the LOD selection is based only on LOD spheres
So do you build your AABBs for the BVH based on the group LOD spheres?
I know LOD selection depends in part based on the meshlet group position
But like I don't quite see how the BVH interacts with that
well, not really
abs error is not identical across an entire LOD
it varies considerably
Does it? Ok
it just whatever meshopts spits out as the highest edge collapse error
Remind me how I build the leaves of the BVH though? It's meshlet groups, but what do I use for determining the AABB bounds for it's BVH node?
if some part of the mesh lower tri density, error will be higher for the same lod
individual meshlet AABBs iirc
You're not using meshlet groups for your BVH leaves?
truuuue, ok
uhhhh i forgor lemme check what i actually do
so group lod spheres are the union of all meshlet lod spheres
over actual group AABBs apparently
like, just union of meshlet AABBs
don't ask why

right, that's what I thought
but then how do you garuntee that error is monotonic down the BVH?
i store unioned lod bounds too for projection
just don't use it for SAH
maybe i should
also my error projection still isn't quite right so it's not actually monotonic always 
things like disappearing sometimes
but very rarely
right, and this seems like a problem
shouldn't be a correctness problem, maybe just a perf issue
because when converting the temporary build bvh to the full bvh i recursively merge all lod and cull bounds again
and max out abs errors
idk tho
i wrote this code with zero refs because karis doesn't talk about it in the slides 
it's a miracle it even somewhat works
yeah :/
this is how tidos culling looks like atm
quite simillar
LOD 0
LOD 1
LOD 2
I don't know why it gets so much worse over time D:
how I'm grouping clusters must be poorly done
that, or splitting the group ig
how are you able to see invidual lods?
also I might be too tired but I dont see anything really wrong
number of meshlets is going down
Messing around with error projection
I'm using the exact same mesh as https://shawntsh1229.github.io/2024/05/08/Simplified-Nanite-Virtualized-Geometry-In-MiniEngine/#Cluster-Partition but they have way better partitions
I need to play more with the nanite lods but first I want to optimize the rendering
I'm going to try adding spatial links between clusters and see if it helps
I'm not sure how to do it tbh
I want to focus on minimizing shared edges
But also have it so if that there's an equal number of shared edges, then to use spatial weights
I see tiny meshlets
which means that clustering groups after simplification is messing up yeah
what's your target tri count?
128t 255v
no like percentage for simplify
you mean making meshlets out of the simplified group?
yeah
Amount of indices in the group / 2
8 meshlets
Nah still a problem
I keep ending up with these tiny meshlets :/
try bumping up to 16 as well perhaps
How long do your assets take to build?
pretty much instant for the bunny, much longer for larger meshes
Same problem
rip
15 minutes for lucy, but less than a minute for something with 4 mil trognles 
something isn't quite linear
I need to build a way to visualize groups, so that I can see if it's the grouping, the simplification, or the splitting step that's the issue
jasmine's blog and the nanite slides personally
Thank you. Nanite is similar to normal meshlet rendering but with nanite lods and software rasterization in compute. I will send you resources when I come home 
The thing that most help me was lurking here and reading everything that I could but also the implementations
https://jglrxavpok.github.io/
https://jms55.github.io/posts/
https://shawntsh1229.github.io/2024/05/08/Simplified-Nanite-Virtualized-Geometry-In-MiniEngine/
https://www.youtube.com/watch?v=EtX7WnFhxtQ
https://www.youtube.com/watch?v=BR2my8OE1Sc&list=PL0JVLUVCkk-l7CWCn3-cdftR0oajugYvd
jglrxavpok is also on the serveur, jmsine too ๐
lvstri and wpotti also have some clue
I need to fix my random crashes...
0.0029 TRILLION vertices
technically 16T vertices ๐ค โ๏ธ
I should get rid of the vertex count. its the unique vertices count
then show the real values
bug if you use timeline semas iirc
it's not a real bottleneck
you should go work for them
would like to but I need to get out of the high school first 
so in 4 years(also after collage) 
your school is high? 
i miss high school
after I graduate I will have long summer holidays about 4 months and then collage starts...
life was so fun back then
now i speedrun 2 months of coursework in 3 days after the deadline

meanwhile I will work. I am lucky that after graduation I need to just call a manager at one company to get a job immediately 
I hope the collage wont suck that much
I already know some cs stuff from hs which is like 2 semesters
Honestly I dont
you will soon enough 
friends and classmates are good. But boy doing nothing or some worthless stuff
at least I know how to configure ms dos and set up vlans
I wont be missing waking up at 5:00 to get to school
ok yeah i don't miss that
i love waking up at 4 pm and missing all my lectures
it's great
night owl I see
I am in last grade and there is almost nothing to teach and final exams are around the corner. So my thurdays and fridays look like get to school at 7:00, sit 2 hours some linux stuff, sit 2 hours for java stuff, english and another 2 hours of java where teacher shows to how brew beer
Also today we had some students from collage to present why to choose their collage. Not joking the pros of their collage is they have their own brewery and pub in faculty
yeah 12th grade for me was vibe, chemistry, then vibe
are you germany
surely you're in germany
a lot of collages have pub on faculties 
nope
lol
czechia
we've got like 3 student union run bars
we drink 2x more beer than germans
no brewery tho
this isnt a joke
this is just beer
we have own wine and hard alcohol
surely the UK is gonna beat y'all on pure alcohol consumption right
yum
doubles as nail polish remover
nail polish or nail polish remover
In few days christmass holidays which I will spent preparing for final exams 

Standard is your native language 2 essay, grammar exam and read 20 books and they will ask you questions about everything even authors. Then English or Math, English is an easy essay and grammar and oral. Math is self explanatory. But I am in computer science high school so I have additional programming, databases, networking, hardware and some stuff about boolean algebra, processors, etc... and then also high school thesis(basically bacholar thesis)
this is what awaits me
in few months
Each cs subject is 30 questions
It is a lot to remember
All these exams happen in 1.5 week window
wtf
Also I will have my collage entrance exams before I graduate
yeah thats why I am
about final exams
Ok, so here's what my meshlet groups look like at LOD 1
I'm gonna try METIS for grouping triangles into meshlets
it looks a bit better at least from this angle
Anyone have unreal they can run the mesh through and see how it looks in nanite?
Thank you!
thats going to take some time
Nw ๐ . I appreciate the help.
See if Nanite can visualize meshlet groups, or only meshlets.
And try to get close enough to visualize LOD 0 and 1 please
Hmm what is "patches"?
I wonder too
triangles?
unreal has now nanite tesellation so its maybe related to that?
the term patches suggests that yeah
Ohh probably
I was going to say, it could be cluster groups, but I never saw them use that term.
Tesselation is more likely, given that it was added after their 2021 presentation
Anyways I think there probably is improvements I could make to the cluster grouping
But I have a feeling that my main issue rn is meshopt's meshlet algorithm performs poorly for Nanite
Small meshlets are very problematic when you need to make the next LOD
And it dosen't prioritize minimizing locked edges
So, I'm gonna try METIS
It is from their new presentation
https://www.unrealengine.com/en-US/blog/take-a-deep-dive-into-nanite-gpu-driven-materials
They have done some serious improvents to rendering and materials
Yeah I've read that. I haven't spent any time on material optimization or ergonomics yet.
I plan to eventually though
Same here
Really hoping we get device generated commands in wgpu before I have to tackle that...
I want to spent few months on nanite until I graduate then I will have fun with shading and other stuff
For what it would be useful? I only see it handy in culling
or also materials
brute force depth comps per material ๐
Having a pass analyze the visbuffer, and then write out 1 dispatch per material actually needed is much cheaper.
No that's what I do rn, but it's still:
- A lot of depth tests
- A lot of CPU overhead
- Has issues with occupancy and empty draws. See the nanite presentation.
On consoles yes. On PC they didn't have anything available at the time of the presentation except workgraphs, which aren't as great as a fit as DGC.

I will clean up my code and fix CPU perf issue
Then try loading bunny
Few weeks ago I tried reading dgc code and it was awful
good thing they allow you create your own structure and tell it what to read
also are you running meshopt_optimizeMeshlet for each meshlet? @summer sigil
lumen is having some schizo moment
What's wrong with it?
oh, the boiling
That's inevitable
All you can do is design your game to work around it, try to hide it with texture detail, tweak some variables, etc.
๐ญ
I am being so gaslighted rn
I am setting positions correctly but flecs is not doing its thing
I kind of made progress on using METIS for meshlet generation instead of meshoptimizer
Not quite working though...
Did you look into SimNanite how it does it?
it doesnt provide any weights to metis at all
I'll have to look
I am hitting some weird bug/ub inside my ecs 
No clue why its happening so ditching this and will work on something else
how
Ok so metis is choosing triangles for meshlets that are completely disjoint, ahhh
so, my graph edges are wrong, maybe
Maybe it dosen't work if I do it on vertices
I'm partioning triangles based on shared vertices
Although maybe my vertex IDs are wrong, hmmm
mmhm
are you converting the mesh into dual graph?
then that might be it ๐
oh hang on I might know
ah yeah that fixed it
I was essentially saying triangle_id = indices[i] / 3
when it should be triangle_id = i / 3
The issue is ofc these tiny meshlets still
And I'm only setting partition size to 120, not 128
Because otherwise metis might go over and hit 129 or 130
๐
Annoying that METIS has these tiny meshlets too
But manual recursive bisection might help, idk
Ok so even if there are small meshlets at LOD 0, the good news is that they don't get "stuck" as you build further LODs anymore! This is a visualization of the groups (not meshlets)
no tiny groups!
And if I zoom out on the meshlets view, no tiny meshlets stuck!
๐ญ I come here to look at an interesting project and I get ptsd from networking this year
I did all of my teachers 30 packet tracers assignments in the last 3 days of school
let me traumatize you even more
I am doing rn some cleaning up. I dropped VRAM usage for mesh instance 50% and meshlet instances 25%.
My todo list:
- optimize culling
- try getting rid of indirection in rendering(meshlet instance index -> meshlet instance data)
- get rid of stupid populate meshlets shader that I have which sole job its iterate over all meshlets in mesh and write them out
Finally, fixed LOD 0! #showcase message
what's the secret 
Using METIS instead of meshopt to build meshlets, and a lot of careful tweaking of METIS
I use 128t:255v meshlets
that does not sound fun
And for triangle partioning, you want to:
- Build a graph where triangles are nodes, edges are between triangles that share an edge/vertex, with the edge weight being the count of shared edges/vertices
- Set ufactor to 1
- Use recursive bisection for the partioning method (still need to confirm that this is needed over kway)
- Aim for triangle_count.div_ceil(126) partitions
Still need to test kway again, then test on more meshes, and then work on the meshlet partioning for groups
hmmm bisection would allow it to be multithreaded too
Rn I'm still using a 8 target group size with ufactor=200 and rejecting <5% simplified groups. All this needs tweaking still.
I'm not actually recursivley bisecting it myself, although I've heard this gives better results. Metis has a recursive bisect mode builtin.
But from what I've heard the best way to partition things is recurvisley call part_recursive(partitions = 2) until you're close to the target partition size
I very much suggest spending time on the DAG build
would this be optimal for group partitioning too?
It's the most important thing for perf
Which part?
recursive bisection in general
Yes
wellll i broke my entire runtime rendering when messing with adding a pathtracing ground truth mode
so i have to fix that too 
Algorithm 4 https://pettett.github.io/projects/meshes/main.pdf#page=25, kinda.
Not sure it's entirely perfect, but it's the rough idea
I think unreal uses METIS_PartRecursive instead of METIS_PartKWay like that paper does
A couple people have
well i guess i have a fallback if i don't find something new till then lol
I think you might need to do this if you want perfect 128 size partitions
Otherwise if you go for 126 in a single go, most hit 126 and are slightly empty
Update: Don't think the partition method matters, and it works well on the cliff too
those meshlets look very nice
yeah, the overall pattern is pretty consistent too
Meshopt generates better meshlets if you're not doing hierarchal LODs, but yeah METIS helps for this a lot
Idk about vertex reuse, but culling yes
Meshopt generates a lot more uniform meshlets
I've been thinking about BVH based lod/culling and it just dosen't seem to make sense
These are the amounts of meshlet groups per LOD level
1984
1084
566
284
142
71
36
18
9
5
3
2
1
1
1
if you have one BVH per level, that's such an unbalanced tree
I mean it's still better than parallel brute force, but
remember it's a BVH8
so each level has 8x the capacity
Wdym?
each level has 8 children
so every level you add increases capacity by 8x
so every ~3 LODs adds a level
which isn't as bad
No but I mean
You make a seperate bvh per lod
But each lod has ~half as many groups
So you end up with basically all of the nodes concentrated to the left side of the tree, no?
Which I mean I suppose is fine, idk
which tree?
The overall tree
I guess yeah
idk maybe nanite does something smarter than me
it probably does
but my thingy is fast enoughโข๏ธ
its time yall nanite people cook up a program/slides where all your various tekneeks are compared against each other
ze bunny scene, ze bistor scene, heck why not even the deccer cube scene (the complex ones ina 20x20x20 cube)
because i have total overview who is implementing what based on what and what outcome each of yalls stuff has/does/makes
i dont get it. Whats "the left side" with a bvh8? You can balance a bvh8 as well, it just have some one or some partially empty leaves
The way nanite does it is make a seperate bvh per lod level, and then combine the roots into a new bvh node
i see what you mean
i wouldnt really worry about that
maybe the largest lods should start earlier in the tree somewhat
i think you can try to balance that also
wont be perfect but you cant balance that anyway as the lods will just have different depths
Right, that's the problem
It's widely unabalanced
well thats fine, its just the nature of yhe tree that comes from combining different tree roots
you have to traverse a big part of many trees anyways so its mostly like traversing many trees, just a nicer interface to have one
wait does it?
what I explained was just how I did it
I dunno how nanite actually does it
Afaik yes
Finally back. To do some work on this project
I made rendering a bit faster by getting rid of indirection for meshlet instance data from queue. It decreased frame time by 0.4-0.3ms. It aint that much for +-7ms frame.
Only downside is increased bandwidth and memory needed.
Also that I am limited to the half of meshlets that I could have before
In my case that is 33 554 432 meshlet instances
A bit late but happy new year frogs 
happy new frog my year
I hope dwm.exe burns in hell
messing up my profiling
but looks like I got another 0.2ms from frame time
@small osprey the leaf nodes you build your BVH out of are groups of clusters. Are the clusters pre or post simplification? I.e. the clusters you grouped before simplifying, or clusters you make after simplifying?
post simplify i think
actually wait
lemme check the code
yeah i lied

groups are before before simplify
but their parent err is filled in after simplification
right, that's how I had it, but then why does that make sense?
"if error is low enough, display group"
...but the error is the amount of deformation of the simplified group (new meshlets). Not the original group.
i store self-error in the meshlets, and parent-error in the groups
because only parent error is stored in the bvh
the bvh traversal visits all groups with parent error that isn't detailed enough
then meshlet cull goes through and only renders meshlets with self-error that's detailed enough
you must be in negative milliseconds by now ๐
I wish
So my 125 bistro + sponza scene is running around +-4.2ms when seeing everything
I shaved off +-1ms from it
major gains was when I removed indirection to fetch meshlet data
threads now directly fetch the needed data to fetch mesh, meshlet and transform data
The todays win is removing populate meshlets shader for prefix sum and binary search.
Next step is bvh culling idk if I will do it
now add VSM
I am planning to do it but I want to do naniteโข rendering done
I can finally focus on writing the high school thesis/blog
ah
About nanite ?
yes
VSM cringe just RT everything
then why nanite in first place 
It is going to take some time to write it 
I have to write it in 2 different languages
you can't RT for primary rays
too slow
I still dk how I'm gonna nicely RT stuff ngl
rn I just throw the high poly into a bvh
I am planning to have PT reference to my raster cope
But rn sleep is in order and studying for stupid literature exam on 5 books + authors
does what?
puts magnifier icon on desktop icons
its not. Its some ease of access windows thingy
I used linux for while but physx omniverse support kind of pissed me of after fighting it for 6 hours and rage switched to using windows 
I only touch bluetooth settings so windows automagicaly did something 
maybe you pressed a magic win+shift+something shortcut triggerent the accessible isms
i used to get that switching fucking IME accidentally
left alt + left shift
Anyway tomorrow school starts again 
This half of school year will be spent going to school doing nothing, gym, studying for collage and final exam
I will have collage entrance exams in 20 days. Hopefully you dont need to study really its logical and reading exam ๐คทโโ๏ธ
other exams are months away
i wish you best of luck ๐
This is just third of questions for computer science part of final exam
Maua Zaba is missing on the list
Literature is 
sadly they dont ask question about frogs 
: (
Each file is circa 2/3 4A and some more
MINES CANCELED FOR SNOW
So yeah this half of the last grade is not going to be fun
but I will be rewarded with 4 months of holidays
and the college starts 
Dont talk about snow. Czech weather pulled out some funny card. It didnt snowed for 20 days but today it snowed then we had warm weather so some snow turned into ice
So my groceries looked like this
Literally every 5 metres
LOL
You mean just physx or you wanted to use omniverse too ?
The version of PhysX is called like that
Basically PhysX 5
Okay
I Got i to compile on linux
But it was hell
Basicaly you have to compile it manually by selecting the source files you want ๐ฅฒ
Their cmakelist is doomed
I used vcpkg package and it was 6 hours of creating patches which didn't work
NVIDIA moment
But if you have it still on disk will you share the fix?
And the amount of GCC warnings is insane
I am clang user which was fun also
I can give you my cmakelist if you want
Would appreciate it
Okay i'l give it when i start my computer
๐ซก
Figuring out how to build the bvh is quite frustrating:/
What is exactly the problem?
You can I think i will open an issue to make it public
Very annoying to wrap my head around which nodes contain pointers to what, and how to setup the groups
very fun 
just copy my code 
Without understanding it? No thanks :p
lol
when yoinking code I find it helpful to just stare at the function I'm copying until I understand what it does 

Okay after fighting with cmake with linker flags. I have managed to load the translation units to learn thanks to daxa I cant reload my code because my pch includes it everywhere 
good riddance
that is going to be a lot of fixing
@merry laurel can you renderer render this ?
mine without culling and shading on a rtx 4070 is at 0.2fps
Does it have materials?
no
id does not have uv lol
i had to tweak my asset manager to load it
but this looks like the ultimate test for your renderer
I will try to give it some uvs and materials so my renderer survives
i gave it 0 0 for all uvs x)
I dont even handle that
gltf file has to have uvs otherwise boom
Also I hope it fits into my meshlet limit of 33M meshlets
this thing is so huge that it broke entt's ecs by having too many entities x)
i dont have the count but i had to force the library to use uint64_t for entities
Oh okay then my CPU will suffer
I iterate over all entities twice to check if dirty
I wanted to fix that but flecs wasn't behaving right
flecs ?
entt but better
why is it better ?
do you have scene graph with your ecs ?
you mean hierarchy then yes
you do you manage this ?
elaborate more?
a scene graph is not data oriented so i cant see how it can fit with an ecs
your nodes contains only ecs ids ?
I use flecs to do some magic for me you could imagine it being database and queries same as sql select. Flecs does the heavy lifting for to be faster in data oriented paradigm
this is cpu only
gpu is plain data oriented
but I have some ecs hierarchy there
trying to load it in blender ?
you could add a static page file with fixed min=max to extend that
I need to fix uvs and materials
does blender have culling and optimizations to render those huge meshes ?
lvstri also had a 128gb page file iirc on top of his 128gb of physical ram to export ue city hehe
i just deleted my swapfile to avoid destroying my ssd x)
i dont know if i can still load this model
you get frustum culling with addons
blender is so shitty software
at least for these big scenes
I am tempted to make my own blender
implement your nanite cope in a blender plugin
NO
no better yet, coerce donmccurdy into rewriting the gltf import/export plugins to run natively not via python
๐
Or I will my asset manager
Blender was already choking on caldera
Funny thing about caldera usda file
It has all the LODs but blender usd support sucks and will show you the worst
good luck exporting it
do you have it ? ๐ ๐ ๐ฅบ
yeah caldera was a pain in the ass
nope, gave up
4hrs or so for the airfield piece
because how stupid blender is
single threaded too ๐
can wait