#Iris - A Journey through OpenGL and beyond to learn Graphics
1 messages ยท Page 15 of 1
: )
bistro be looking very square 
its the bistro square after all
ah and because you dont take transforms into account, the lights are gone here too : >
yep :(
top priority right now is figuring out how UE deals with partitions bigger than the prim upper bound
and also fixing horrible meshlet islands
when is next exam?
23rd with a project deadline the 16th 
I also have to think about how to store this DAG
we also need a name for this tech
current candidates are:
- Frognite
- Nanofrog
I'll also ask Suslik tomorrow if he still remembers how his clusterizer worked
I can't sleep like this
I have a trillion ideas I want to try for this but exams and I gotta sleep
why does life
whenever I can form coherent sentences about graphs and kdtrees 
I think I'll shelve the custom clusterizer for now and try using meshoptimizer's
maybe it's viable maybe it isn't, regardless I have about 20 more steps to get nanite
the next one is figuring out the DAG (and consequently meshlet borders, locking and adaptive simplification)
I'll work on lumen while you do nanite
epic
games
one day we'll merge our powers and combine VSM Lumen and Nanite
and then we'll get cease and desists from Epic
have you determined if it's possible to implement a subset of nanite and still observe a benefit
i.e., reduce the work for smaller gains
because having any auto lod at all is very useful
I mean in terms of implementation effort
oh
I'm going for the simplest possible impl
So far it's a bumpy road
I think the major hurdles will be due to edge cases
don't care yet about those but I'll try to mitigate them when I see one
I also realized that I don't understand mesh shading at all
as in "what is a vertex index buffer"
"for indirection into the vertex buffer allowing vertex reuse"
is a very valid answer and the one I'd have given until yesterday
I gotta get a deeper understanding on that though to make goodโข๏ธ partitioning
this doesn't matter if you just use meshopt for the initial clustering and "split" steps
although meshoptimizet is suboptimal for nanite, I confirmed that
long meshlets + potentially discontinuous island are terrible
this means that two different meshlets might share two borders
is that why you were making your own clusteroni
which breaks some assumptions
I'm trying and failing
I'll ask suslik for help tomorrow
sucking is the first step to being kinda good at something
the two major issues with my own clusterizer right now are that my heuristics suck
and that partition size is not guaranteed
so this happens 
does it really matter?
yes
when you merge all tringles into one list, and then just partition by 128 tringles
mesh shaders have a strict upper bound
and fill up with NANs at the end
that's one solution
Subtract one from the primitive count smh
subdividing might be a better one
ah
that i even understand : )
or you smuggle in a little low poly frog mesh
in 127 variations, to fill up all possible scenarios
when the last meshlet only has 1, 2, 3, 4, 5 etc vertices then the frog127, frog126, frog125etc mesh goes here ๐
would be funny to use little frogs as a debug mesh
I do use the good froge
the big froge works nicely
it's one primitive, it's reasonably complex and it has meshlets of all shapes and kinds
also that made me think about a possible low poly aesthetic for a game that uses as few vertices as possible to represent an object
VK_LINES castle
that one big tower which towers over all the other parts of it
you keep increasing GL_LINE_WIDTH until its big enough to switch to actual mesh, as you come closer
lustri, write your ideas down and then focus on your exams
unless you can convince sir brain to hire you out of school before all that
this be a meshlet
how do I find the edges of said meshlet 
do I have to take into account winding order?
criver might know?
perchance
huh
maybe border edges can only connect two vertices
wait
I mean
border vertices can only be connected by two edges
or maybe 3
at most 3
can an inner vertex be connected by 3 edges though?
yes..
hmm
oh old on
a boundary edge is only referenced by one primitive
so if I encounter the same edge twice when scanning primitives then it's not a boundary edge
now if I have this index buffer [0, 1, 2]
How do I traverse it?
0 -> 1
1 -> 2
2 -> 0?
Rather how does the GPU do it
the rasterizer gets a list of vertices
let's say it's just a triangle
[0, 1, 2] is its index buffer
I guess the order of the indices defines the winding order?
yea thats it
assuming the vertices stay static
but that is defined by the positions as well
the winding order is determined by the signed area of the triangle
which you can calculate from the determinant of the vertices or sumshit
right yeah
but I mean
the signed area changes if I swap around indices in the index buffer
it do
ok I think I can do this maybe
yes you can
where's my hashmap
my man Federico was rendering 373 million triangles meshes in 2008 with a fucking 6800 GT and OpenGL 2
it works
holy shit it works
let's fucking go
don't mind the uh
the uh
auto clusterEdges = std::vector<std::vector<std::pair<std::uint32_t, std::uint32_t>>>(currentClusters.size());
auto clusterEdgeCounts = std::vector<std::unordered_map<std::pair<std::uint32_t, std::uint32_t>, std::uint32_t>>(currentClusters.size());```

it's ok this is at build time it isn't meant to be performant
it's not that I can't code
nope, totally unrelated
Cache: sayonara...
might as well write CacheMiss<CacheMiss<CacheMiss<uint32>>>
List<List<Pair<
List<HashMap<...
and a diet coke
buy one of the X3D cpus for epic cache to minimize the damage of this situation
true
buy AMD
(not sponsored)
ok now we can partition
and then make a shrimplify and split
and then make a dag
then runtime LOD selection
very easy
for (auto index = 0; const auto& cluster : currentClusters) {
auto clusterEdges = std::vector<std::uint64_t>();
for (auto k = 0; k < cluster.triangle_count; ++k) {
const auto basePrimitiveId = cluster.triangle_offset + k * 3;
if (edge0.first > edge0.second) {
std::swap(edge0.first, edge0.second);
}
if (edge1.first > edge1.second) {
std::swap(edge1.first, edge1.second);
}
if (edge2.first > edge2.second) {
std::swap(edge2.first, edge2.second);
}
clusterEdges.emplace_back(static_cast<std::uint64_t>(edge0.first) << 32 | edge0.second);
clusterEdges.emplace_back(static_cast<std::uint64_t>(edge1.first) << 32 | edge1.second);
clusterEdges.emplace_back(static_cast<std::uint64_t>(edge2.first) << 32 | edge2.second);
}
std::sort(std::execution::par, clusterEdges.begin(), clusterEdges.end());
auto clusterBorderEdges = std::vector<std::uint64_t>();
clusterBorderEdges.reserve(clusterEdges.size());
if (clusterEdges[0] != clusterEdges[1]) {
clusterBorderEdges.emplace_back(clusterEdges[0]);
}
for (auto k = 1; k < clusterEdges.size() - 2; ++k) {
const auto& previousEdge = clusterEdges[k - 1];
const auto& edge = clusterEdges[k];
const auto& nextEdge = clusterEdges[k + 1];
if (edge != previousEdge && edge != nextEdge) {
clusterBorderEdges.emplace_back(edge);
}
}
if (clusterEdges[clusterEdges.size() - 2] != clusterEdges[clusterEdges.size() - 1]) {
clusterBorderEdges.emplace_back(clusterEdges[clusterEdges.size() - 1]);
}
++index;
}```I made it less cancer
I guess now that I sort this I can make a sliding window search to look for shared meshlet borders
although borders can get pretty big
eh it's fine, CPUs are fast 
and since I'm already breaking half of the GLTF spec, why not break it even more, I'll define custom attributes to store dual graph topology & stuff
hmm I guess edge weights now are the number of shared boundary edges
wait hold up
uniqueEdges = List<HashMap<uint64, uint64>>();
for (index, cluster) in clusters {
clusterEdges = HashMap<uint64, uint64>();
for (id, _) in cluster.primitives {
const auto edges = [
sort((cluster.triangles[id * 3 + 0], cluster.triangles[id * 3 + 1])),
sort((cluster.triangles[id * 3 + 1], cluster.triangles[id * 3 + 2])),
sort((cluster.triangles[id * 3 + 2], cluster.triangles[id * 3 + 0])),
];
for edge in edges {
++clusterEdges[(edge.0 << 32) | edge.1];
}
}
uniqueEdges[index] = clusterEdges
.erase((edge, count) => count > 1);
}
clusterBorders = HashMap<uint64, List<(uint64, List<uint64>)>>();
for (i, _) in clusters {
sharedEdges = List<uint64>();
for (j, _) in clusters {
if i == j {
continue;
}
for edge in uniqueEdges[i] {
if uniqueEdges[j].contains(edge) {
sharedEdges.push(edge);
}
}
if sharedEdges.isEmpty() {
continue;
}
clusterBorders[i].push((j, sharedEdges));
}
}```
damn
wait no still wrong
shit
ok now it's good (maybe)
ok pseudocode helps
std::unordered_map<std::uint64_t, std::vector<std::pair<std::uint64_t, std::vector<std::uint64_t>>>>()```
it got worse
That's heresy
i like how you exaggerated with the std:: for uint64_t ๐
I could've copy pasted my own integer types
but I feel like it's too late now 
Anyways I bothered Suslik again
I hope he doesn't kill me
what were you aksing? ๐
we soon need a meshlet containing emoji
shit
the partitions are wrong
how tf
ok time to go back to balls
I might need bigger balls though
goddamit
it's the disconnected meshlets
fuuuuck
meshopt_simplify also fails spectacularly to halve the number of triangles
damn
it's going wrong in so many different places at once 
nevermind
my ability to code dwindles with every second
I don't even know why I assumed meshoptimizer was wrong instead of my ridiculously dodgy code
meshopt_simplify works so flawlessly it's scary
alright
no seams/tears
let's fucking gooo
but suzanne had a bit of a lobotomy 
alright moment of truth
let's randomize vertex selection
those traverse guys are full of shit
beautiful unchanged borders
we're nearly done boys
all I have to do is the goddamn DAG and we win
this is just a ball 
oukay : )
yeah i can see we are inside
man this is too good
we're almost there boys
and then I'll replace every one of your renderers with this
all this meshletism is because the gpu has an easier time scheduling work for those smoller islands of verticles?
yeah
it sounds counter intuitive at first
more vertex groups to be aabb or whatever checked
than just the original mesh primitives
i cant wait to unlock this achievement too hehe
I'll delegate Jaker to write documentation for my nanite
so that you can go ahead and implement it
oh yeah, potrick and saky were working on nanite 2 as well
were?
saky got stuck writing VSM then 
and potrick got the culling brainworm
alright this is it for today
tomorrow I'll do stable partitioning and cross cluster boundary edge weights
and maybe I'll figure out how to hack glTF to store le epic clusters
sounds like a plan
real
hmm maybe the usual limits are causing clusters to be too smol
with 128/128 I get sparser but more consistent clusters
the usual game of tradeoffs
ok 128/128 is gud enough
now
what do I do
- store everything needed to build the DAG and build it at runtime
- serialize the DAG and somehow also the vertex/index data referenced by said DAG
I feel like the first one is easier
goddamnit
this shit is so slow
for (auto k = 0; k < meshClusters.size(); ++k) {
auto borderingClusters = std::vector<std::pair<std::uint64_t, std::vector<std::uint64_t>>>();
for (auto z = 0; z < meshClusters.size(); ++z) {
if (k == z) {
continue;
}
auto sharedEdges = std::vector<std::uint64_t>();
for (const auto& edge : uniqueEdges[k]) {
if (uniqueEdges[z].contains(edge)) {
sharedEdges.emplace_back(edge);
}
}
if (sharedEdges.empty()) {
continue;
}
borderingClusters.emplace_back(z, std::move(sharedEdges));
}
clusterBorders[k] = std::move(borderingClusters);
}```
and no shit
O(n^2)
is indeed slow 
ill bring back the idea of 127 variants of a frog, to fillup the remaining vertices in a meshlet
no worries
I have fixed everything
โข๏ธ
ok now I really have to start to think on how to store this shit
looking at meshlets at different LODs interact perfectly is fun
but I want to see them switching at runtime
so
glTF
wonderful format
how do I shove meshlet in it
turns out, I didn't fix everything
Bevy is working on a meshlet extension for gltf for our use case
epic
unfortunately I am going insane
so the border issue is irrelevant
but the "generating good meshlets" issue is still alive and kicking
I thought you stopped using METIS
I'm using both METIS and my own graph algos
some stuff is just painful to implement 
ah
but it's ok, METIS isn't giving much trouble now that I know what the cryptic shit Karypis wrote in his document means
anyways the issue is
neither METIS nor I can generate functional meshlets
as in, I cannot generate meshlets that stay under the 128 primitive limit
I can't enforce that limit
further still, meshoptimizer still greatly prefers vertex reuse to spatial locality
Have you seen the algorithms in this paper? https://github.com/Senbyo/meshletmaker/blob/main/README.md
I have the paper open in my browser but I still have to read it 
maybe I'll read it now, draw inspiration from it
Sponza is also a ridiculous mesh
are the big triangles making you down
no
it's the pillars
it's a single mesh
so the clusterizer just assumes it's contiguous
but it isn't
is dis old sponza
ye
is that your code
it's meshoptimizer's
but the same issue happens with graph partitioning
the absolute funniest thing
is that Unreal doesn't give a shit
and just clusterizes all of sponza at once
hmm so make a big soup and then submit the triangles to the clusteroni
yeah
it's just one primitive
which is great, but you still have discontinuous islands in your graph
the pillars and the arches share no vertices
for some goddamn reason
and if I don't I get bad partitioning later because something that should have been a border, actually isn't
bad partitioning = less than ideal partitions
which in turns restricts the simplifier from simplifying
and in turn the split step cannot actually split anything due to the mesh being too simple
ok I figured out how to make the micro index buffer
thank you zeux for commenting your code 
new approach
I think I cracked it
ok maybe I did crack it
This is unreal's ball
This is mine
I think this is good
wow looks like your clusters are more rectangular too........
instead of a more stretched aspect ratio
Unreal generates twice as many clusters as well for some reason
LogStaticMesh: Input: 4130 Clusters, 520344 Triangles and 1199795 Vertices
LogStaticMesh: Output without splits: 4130 Clusters, 520344 Triangles and 1199795 Vertices
LogStaticMesh: Output with splits: 8233 Clusters, 520344 Triangles and 1202735 Vertices
LogStaticMesh: Material Stats - Unique Materials: 1, Fast Path Clusters: 8233, Slow Path Clusters: 0, 1 Material: 8233, 2 Materials: 0, 3 Materials: 0, At Least 4 Materials: 0```
I gotta look at UE's source to figure out how the hell they're coercing their graph partitioning to stay true to the primitive limit
ok they just bisect the graph when it's greater than expected partition size
figures
ok, random ideas I got while I was studying
- One DAG per glTF primitive
- the DAG only has one root node if the primitive is less than 128 primitives or two leaf nodes and one root node if it is more than 128 but less than 255
- The simplifier reaching the target goal doesn't matter
- The cluster partitions must be at least composed of 4 clusters per partition
- Make a better algo for searching connected clusters because O(n^2) where n=numClusters is absolute shit
(it's probably parallelizable)
do you have to find all these clusters during runtime????
is all offline nonsense right?
ye this is all offline
then dont put too much brain-cell-lets into this
I think
maybe I'm approaching this in a wrong way
instead of hoping partitioning and clustering will work with generic meshes of a generic triangle count
why not make sure everything is a multiple of 128
so basically subdivision to make sure a mesh is always at least 128 triangles
let me check what happens if I feed unreal engine a single triangle
ok unreal doesn't tessellate/displace a cube or a triangle
I assume they simply skip all the steps and make the DAG with a single root node
yup
LogStaticMesh: Display: Building static mesh Mesh...
LogStaticMesh: Adjacency [0.00s], tris: 192, UVs 2
LogStaticMesh: Clustering [0.00s]. Ratio: 1.000000
LogStaticMesh: Leaves [0.00s]
LogStaticMesh: Reduce [0.00s]
LogStaticMesh: Fallback 0/1 [0.00s], num tris: 192
LogStaticMesh: ConstrainClusters:
LogStaticMesh: Input: 3 Clusters, 318 Triangles and 277 Vertices
LogStaticMesh: Output without splits: 3 Clusters, 318 Triangles and 277 Vertices
LogStaticMesh: Output with splits: 3 Clusters, 318 Triangles and 277 Vertices
LogStaticMesh: Material Stats - Unique Materials: 1, Fast Path Clusters: 3, Slow Path Clusters: 0, 1 Material: 3, 2 Materials: 0, 3 Materials: 0, At Least 4 Materials: 0```
if a mesh is >128 but <255 then Unreal does some shenanigans
I assume this is simple subdivision and not catmull-clark
catmull-clark subdivision would obviously be bad because it changes the surface
(as in the set of points)
tbh I kinda hate catmull-clark even when using it as an authoring tool
would be nice to have subdivision algorithm that as you subdivide tends to where you'd trace your shadows basically
yeah that would be nice
the last two lines are me going on a tangent, just to be clear, ignore those
no more scuffed shadows on spheres
scuffed shadows on curved surfaces is already a solved problem
anyways I'll try looking for some subdivision algorithms/libraries that can guarantee me N output triangles
any suggestions?
you could just split random edges (preferrably ones that have big triangle lie on them so that your triangles tend to be about the same size) until you arrive at N
hmm
makes sense
if (primitiveCount % 128 > 32) {
area = CalculatePrimitiveArea(mesh);
edges = CalculateWeightedPrimitiveAdjacency(mesh, area);
subdivided = SubdivideMesh(mesh, edges);
}```
btw obviously you can subdivide an edge not into two edges but e.g. 3 edges
or 5 edges
maybe that's worthwhile
this is not relevant or is it? https://www.youtube.com/watch?v=FFWgQZsfwy8 for some reason i watched it last night
this is catmull-clark which changes the surface so unfortunately not for me
but maybe I can apply this knowledge
@wicked notch do you have a reference for calculating tangents from screen space derivitaves?
Something that can be used with the miktspace normal system bevy already uses
I used this: http://www.thetenthplanet.de/archives/1180 there's some precision issues you'll have to fix if you want a decent result
Otherwise I now prefer just calculating tangents on the host if they're missing
so here's another thing I didn't understand at first
you don't use a graph partitioning algorithm to do clusters and hope they don't go over 128
you use a graph partitioning algorithm to recursively bisect a graph and then until the partitions are the right size
ok graph bisection
are Unreal peeps smoking crack or am I dumb
real_t PartitionWeights[] = {
float( TargetNumPartitions / 2 ) / TargetNumPartitions,
1.0f - float( TargetNumPartitions / 2 ) / TargetNumPartitions
};```
how do you think they made unreal
what is the type of TargetNumPartitions
oh yeah
presumably not float
le integer divison
in programming, we do special math
ok so for a graph with N nodes
I gotta do this
minPartitionSize = 124;
maxPartitionSize = 128;
targetPartitionSize = (minPartitionSize + maxPartitionSize) / 2;
targetPartitionCount = Max<int32>(2, Round(nodesCount / float32(targetPartitionSize)));
partitionWeights = [
float32(targetPartitionCount / 2) / targetPartitionCount,
1.0 - (float32(targetPartitionCount / 2) / targetPartitionCount),
];```
and so this way if I have 385 nodes
I do
targetPartitionCount = Max<int32>(2, Round(385 / float32(126))); which is just 3
float32(3 / 2) / 3 which is 0.333
1.0 - float32(3 / 2) / 3 which is 0.666
ok good
then what I do is
BisectGraph(graph) {
(_, partitions) = Partition(2, graph);
front = 0;
back = graph->GetVertexCount();
swap = [];
while front <= back {
while front <= back && partitions[front] == 0 {
swap[front] = front;
front++;
}
while front <= back && partitions[back] == 1 {
swap[back] = back;
back--;
}
if front < back {
swap[front] = back;
swap[back] = front;
front++;
back--;
}
}
split = front;
partitionSize = [split, graph->GetVertexCount() - split];
// make new graphs if partition size still too big
}
RecursiveBisectGraph(graph) {
output = BisectGraph(graph);
if output[0] && output[1] {
RecursiveBisectGraph(output[0]);
RecursiveBisectGraph(output[1]);
}
}```
isnt float(3/2) still 1.0?
yep
and 1.0f / 3 is 0.33
the point is to have partition weights be either 0.333, 0.666 (or more) or 0.5, 0.5 (or less)
ok, looks like you just didnt update your calculation there : )
I hate the fact that I've gotten used to reading unreal's source 
Donโt say it to Tim to not be thrown into his CBT dungeon
Inb4 lvstri gets poached by epic
Inb4 lvstri is actually Tim in disguise 
inb4 lvstri works at epic already, and uses "exams" as excuses for "i need to finish implementing nanonite 3.0 with my father brian or he will shoe me again"
I bisected a graph
and I took an entire friggin hour to understand whatever the hell I was doing
the [front, back] is a vertex remap
but ye did it lad 
this is so unintuitive it's crazy
unreal does a radix sort of some kind
and then hashes the triangle IDs with the distance from their center as the hash
wtf
ahh yep
as usual, more problems came up!
maybe if I allow looser partitioning I can solve this easily
Ah I see
it's fucking foliage
amazing
the offending things are always foliage because they have the most disjoint meshes
(and some windows, because they are the same mesh for some goddamn reason, so their triangles are disjoint as well)
I could maybe tessellate
are you doing that per primitive?
ye
oh yeah another thing I could do is per tringle range materials
like unreal does
crazy that this is so hard
I can't even see what unreal does with bistro because it can't import bistro 
it literally crashes while building nanite meshes
anyways, brain fog is really starting to kick in so I'll leave this for past 16th me
hmm debugging unreal is not an option neh?
I'll try tomorrow
but unreal is just pain
it takes 10 minutes to import bistro when compiling unreal in release mode
one order of magnitude more in debug mode
EPIC, HIRE THIS MAN 
brian already solved foliage 
time to make an unreal without all the clutter and baggage from the past : >
to be able to import bistro hassle free
to be able to implement VSM 2.0 and nanokatzeite 3.0
ok I solved it
with a garbage solution
(even more bisections)
holy shit it's garbage
ok subdivision is literally a requirement
TODO: make subdivision
let delauny and worley come to you in your dreams
OpenSubdiv is covered by the Apache license, and is free to use for commercial or non-commercial use. This is the same code that Pixar uses internally for animated film production. Our intent is to encourage high performance accurate subdiv drawing by giving away the "good stuff".
I didn't know Pixar was this chad
@wicked notch can it ever happen that a meshlet rendered in the first pass (was visible last frame and not frustum culled this frame) won't be visible after performing occlusion culling int he second pass?
no, the second pass only cares about what wasn't visible before
hmm ok, so I need to find why things are breaking...
oh wait lol
I'm reusing the preivous oclcusion buffer for the current, but never clearing it ๐
OpenSubdiv is pretty nice iirc, I believe it's what blender uses for at least one of its subdivision modifiers
I recall cloning it a while back for that reason
Fixed bugs, got a lot more performance back lol
I was accidentally reusing a buffer without clearing it, so once a meshlet became visible it would never become unvisible, leading to the first pass basically rendering every meshlet every time
And then for shadow views, I didn't realize the frustum was not setup to be uploaded to the GPU, so culling wasn't working at all
So I fixed those bugs and now it's much faster lmao
alright, opensubdiv is nice but it's slow as hell
before I jump into opensubdiv's gpu backends I'd want to try MT
MT?
multithreading
btw I now understand what Jensen meant by "the more you buy the more you save"
the more you draw the more perf you have
that's right, the more I tessellate the better the clusterizer does
@wicked notch how do you cope with buffers in Vulkan
particularly updating them
I sense three usages:
- Update whole thing every frame
- Update occasionally
- Never update from CPU (except maybe clearing it with the clear command)
Anyways 1 and 3 are easy to solve in a vacuum
my public API for buffers is kinda barebones at the moment, I just have this
auto Write(const T& value, uint64 offset = 0) noexcept -> void;
auto Write(std::span<const T> values, uint64 offset = 0) noexcept -> void;
auto Write(const void* data, uint64 size, uint64 offset) noexcept -> void;```
is this buffered
no unbuffered
alright so you manually buffer it
yeah
๐ฅ
I think it may be worth to have something like BufferedTypedBuffer<T> or something
even though the name is garbage
The problem is coming up with a unified abstraction for these uses
But it's probably not that hard, just inherit fam
yep
don't inherit too much otherwise you end up like java
new InputStreamReader(new BufferedInputReader(new StreamAdapter(new FuckThisShit())))
I could template too but then everything is now in the header 
Btw I think devsh handles case 2 by using an upload "pool"
That is the rarest case tbh and I think I only do it for scene geometry updates
I suppose I could hack it by treating those as N-buffered, but rip memory
tbh unless you're moving lots of MiB of data around per frame, it's probably fine to just write it all
or at least, update sparsely without buffering
for everything else I think the staging pool that devsh presents is quite good
hmm I better get crackin on this
I didnโt even parse that as C++ at first lmao
RIP memory indeed, but if you aalways use same amount, thats fine
my thing is only more optimal when you do 64mb one frame, and 1mb another
yeah there's nothing you can do when you write the whole thing every frame
this is the biggest mistake you'll ever make ๐
devsh style upload pools are nice but I still n-buffer stuff I write out every frame
I mainly use them for staging mesh/texture uploads
devsh palpatine incident
unlimited buffering
well he's rigging the TimelineSemaphore Deferred std::function<> to a Pool Allocator
for dem bindless descriptor sets
that's what I do for my fixed size upload pool essentially but I don't see where bindless comes in
so I can allocate/deallocate
basically I go alloc_addr() and it gives me a slot into an array descriptor binding thats currently free
when I'm done using a texture and want to mark a descriptor array item as ready for overwrite, I want to do free_addr(slot) right?
but I probably want to latch that on the semaphore value that the last frame that uses said texture will signal
what I do for that is I have FIF implicit linked freelists and append them to the tail of the global free list
so basically I have a std::vector<BindlessHandle> that's all live and dead handles, then BindlessHandle m_waitingFrees[FRAMES_IN_FLIGHT]; and BindlessHandle m_freeListHead/Tail;
yeah but you can easily extend this to work with more TS if you get rid of the FIF
eeeeh
trust me I thought about it
it gets suuuuper messy with multi-threading
also I don't just do free lists, I do arbitrary functors
just get better locks
it just so happens that 99% of the time that functor is a free/deallocate functor
the functors do throw a wrench in it yeah but with update-after-bind and whatnot the only thing that's in my critical section is disturbing the index list for alloc/free
- I use those fancy 1 byte webkit style locks
it gets super messy, because you don't want to run the functor under a lock
and yielding switches fibers instead of blocking the thread
so just locking is no big deal for me
and you also don't want to run the functor from some unexpected thread/actor
this is why my latch lists are partitioned per-semaphore-per-resource
i.e. for the same semaphore Down Streaming Buffer doesn't keep its latched frees interleaved with Up Streaming Buffer and MegaDescriptorSet latched frees
We don't have a master-cleaner/submitter/waiter thread, so sticking the deferred events in the semaphore makes no sense
also you need to recognise those systems for what they are == garbage collection
^ this is much better
I dont want to be running data-download consumption host callbacks when I just want to poll if I can free a single descriptor when I allocate a slot, all because I stick all my events on the same semaphore's queue
if I slapped all events in the same queue, I'd be susceptible to random pauses like that
I just give stuff like that a dedicated update() method, though I guess you'd still be susceptible to the pauses simply from the fact you contend the same lock as the GC cycle
its also kinda important when the deferred event is a free and the resource is not thread-safe
at least my event queues live in the resource, so if I externally synchronise access to the resource, I wont get any nasty surprises like unsynchronized callback execution
thats a free latched with a data consumption callback when you're moving memory GPU -> Host
the fun part is when you use this little function: https://github.com/Devsh-Graphics-Programming/Nabla/blob/84d9aca59ccf8790b118bc9ddb63772078e20efc/include/nbl/video/utilities/IUtilities.h#L499
so you can actually download a 2GB vkBuffer through a 64MB HOST_VISIBLE
in lots of little submits of 1 copy command
and every single time you overflow (can't allocate) the free callbacks will get run until a timeout
this is a bigger brainworm than I'd recommend for safe consumption
the epic staging pool is good but going farther than that with this stuff is 
I see that it functions like an upload buffer but the opposite way around essnetially
but I also think having thread safety issues with the output of your intermediate data download functors is a skill issue
you don't know how much you want it until you do
upload buffer is far easier to write
glBufferSubData
mine aren't stge 5 brainworm yet and don't do the crazy stuff devsh's do like converting your image format live
cause when you want to push 2GB of data to Device from Host througha 64MB buffer, you just need to wait for Device to finish copying the previous chunk of staging buf to destination before you overwrite it (the staging)
download buffer is faaar weirder
cause Host needs to do shit with the data in the 64MB mapped staging BEFORE you free it (and device overwrites with a new chunk)
so you need a callback that runs AFTER semaphore signal and BEFORE the free, which you need to perform to make progress in the copy-submit loop
right ok I'm starting to see
in the upstreaming direction you already have the 2GB Host source and 2GB Device destination laying about
and if don't you'll naturally make multiple calls exactly the chunk size that the Host produces at a time
so with multithreading it, on top of everything else, you have to take care that the execution time of the free functors doesn't bleed into the execution time of enqueueing
I can explicitly force all ready functors to run with a cull_frees()
or just block XD
while (m_downStreamingBuffer->cull_frees()) {}
๐
generally speaking what the buffer will do is try to run 1 or 2 functors every time you allocate
via a poll()
cause otherwise your allocator runs out of memory and you get faux fragmentation
ofc if it can't service your request, then it will do a wait instead of a poll
the allocate call takes a timeout
the way I'd do it is probably just to have a dedicated update() method that calls a parallel scatter of the functors with a gather function pending on it to free the indices when it's done, if I was assuming that running them synchronously would affect execution elsewhere
and if there's no space to allocate, just enqueue your functor into the not ready list directly
and then you need to remember to call the update() method
after upgrading to timeline semas
I literally never have to run that
unless I want the functors to run "earlier than"
functors will [implicitly] either run during the next allocate() or in the destructor of the staging buffer
so it's valid usage to allocate staging buffers that get deleted when you're done staging?
yep
it just that the destructor might stall you
you'll deadlock if you havent submitted the copies yet though
funky, I'd much rather have to remember to call the update method lol
doesn't sound nearly worth it for surprise sync points
its totally worth it
cause we can write code that does not give shits about overflows
like an faux-immediate mode CAD renderer
that can attempt to draw 400MB of CPU produced geometry through a 64MB ReBAR buffer
mine can handle the overflows, I just need to have an uploadPool.update() in my update loop somewhere
yeah and we don't
you just check the condition for "would be overflow"
and do a "hidden transparent submit"
which signals the semaphore value that you latched all the resource frees from waaaay before you knew there'd be an overflow
then you wait on the semaphore
reset the commandbuffer and begin it again
go back to business as usual with the rest of the code
and while its trying to allocate all those resources, it unknowingly releases resources from deferred functors
I do all that except replace waiting on the semaphore for enqueueing the functor
you need to wait on the semaphore cause you've ran out of space
I opted to structure specifically around never waiting on anything
literally no forward progress can be made
yeah that's why I have the update function though, it does the same TS check you'd do when enqueueing another write job, except if nothing is ready I don't block any threads
which is much better for me because fibers = avoid OS blocks
potatoe/tomatoe
you still need to suspend exxecution of the "Immediate Draw" routine
either by waiting, switching or yielding
yeah that's true, luckily I don't have to port GDI/oldGL code lol
the thing is, with your system I'd have to spam update() literally everywhere
so I can make progress
or whenever an allocation fails
my update() is just rolled into the allocation routine implicitly
yeah I guess the major downside is you have to architect everything around being async-friendly
wdym everything?
everything downstream of your uploads and downloads I mean
nope
the whole code outside of the utilities thinks the commandbuffer never got reset
it goes into the function begun
and comes out begun as well
the only way it can know from the outside that there was a hidden submit is because the nextSemaphoreWait.value has incremented
actually you gave me a funny idea ๐
right now
to overlap Host and Device better
maybe I should have a lambda or some condition which causes artificial allocation failure / overflow
i.e. if I have 128MB staging buffer, so the host doesn't suballocate the whole 128MB, write it, submit and block
but rather tries to request 32 or 64mb at a time max
and if there's only 96 or 64mb free, treat it as an overflow
then I could have it block on a much older submit
eeeeh
but then I'd need multiple commandbuffers to round-robin
cause I can't reset a pending one
hmm
ok I might actually implement this as an improvement
aside from the intendedNextSubmit where the last cmdbuf is the resettable scratch, I could have N of them going round robin
that will be stage 69 brainworm
then if you have the right sized N, the actual function with overflows might actually never block (if Host is slow enough on the callbacks or data filling)
honestly that's a pretty neat idea, would need some tuning to be worthwhile though
back to nanite
I have sent the project in
alright so tessellation is done (it saves my ass so much you have no idea, at least for the initial clustering)
strict graph partitioning is done
simplification is done
what's left is
- figure out how I'm gonna store this shit in glTF
- building the DAG
LVSTRI_nanite_at_home
we're almost done boys
just a little more and we'll have flawless LODs
I accept suggestions for storing meshletisms in glTF
like does one mesh = one LOD group?
how do I express the parent - child relationship of the DAG
via edges
yeah but in gltf
hehe idk
maybe it's just better if I make a custom format
CSR?
compressed storage row
wat dat
like a graph that connects vertex 0 to 2,3 - vertex 1 to 0 and 2 and vertex 2 to 0, 1 and 3 is just written like this
[0, 2, 4, 7]
[2, 3, 1, 0, 0, 1, 3]
o
I shall explain better
vertex 0 => look up range in the first array: start = range[0]; end = range[1], edges connecting vertex 0 are in the second array starting at 0 and ending at 2 (excluded)
vertex 1 => look up range in the first array: start = range[1]; end = range[2], edges connecting vertex 1 are in the second array starting at 2 and ending at 4 (excluded)
...
interesting
I see they use this for sparse matrices too
btw how do you select cuts at runtime 
Custom gltf extension, store it as binary serialized data.
that's easy
process all leaf clusters, if their parent's error is too big but the children are good, add them to the list, otherwise add the parent to another list for further processing
repeat until all clusters have been processed
Also @wicked notch when you're done please make a write up of the preprocessing algorithm and save me a lot of future pain ๐
And tbh how you structured the dispatches over meshlets/instances would be good to compare how I did it against, but less important.
remember that I use mesh shaders
Oh you do, nvm. Well actually no, I'd still like to know
I use your same approach btw if I recall
Doing a dispatch with one workgroup per meshlet takes like 0.1ms per 2^16 meshlets, which doesn't scale great :((
To write out the index buffer
uint32 IndexOffset;
uint32 IndexCount;
uint32 PrimitiveOffset;
uint32 PrimitiveCount;
};
struct MeshletInstance {
uint32 MeshletIndex;
uint32 TransformIndex;
uint32 MaterialIndex;
// other per instance data
};```
if the writing + culling is still combined in one dispatch, then that's your bottleneck
No they're seperate
otherwise idk, perchance check what nsight has to say
Which improved things a lot, but it's still a bit expensive
It's simply the overhead of so many dispatches :/. Wgpu kinda limits me here. I can only do 2^16 meshlets/workgroups per dispatch, and it has to put a barrier between every dispatch.
Do you do this in a compute shader or something
yes
rip
ask the wgpu lads to up the limit 
2^16 as workgroup limit is kinda garbage
Wouldn't it work with fake mesh shaders
It's because webgpu enforces one limit for all dispatch dimensions instead of letting X be much higher :p. But yeah, I'm going to try and fix that in wgpu...
But back on topic, please do a writeup of the preprocessing ๐
When you finish*
Lowest common denominator API 
I will, it's very shrimple though
(it isn't)
it's just the edge cases
they're literally limitless
infinite edge cases
Yeah I don't believe you LMAO.
Oh wait iirc it's because of AMD. AMD doesn't give you higher limits...
2^16 * 128 threads is still like 8 million though
It's 1 workgroup per meshlet
Hmm, perhaps? Let me check what the max overall dimensions are.
That's a good idea if it's possible though
btw @frank sail can you link again that thing about persistent threads in kompute
my froge brain does not allow me to remember events past 2 weeks old
โIt was revealed to me in a dream.โ
should've just googled smh
who's gonna tell LVSTRI
oh yeah this is the part that requires UB to work
fun
wait hold up
By UB do you mean "Unreal 5 Behavior" ?
true
ight so
MPMC queues on the gpu
what's the exit condition
and can I just atomicAdd(taskCount, -1)
nope
alright
So uhh trying to spawn 40x40 Stanford bunnies leads to
Buffer binding 6 range 2905695912 exceeds max_*_buffer_binding_size limit 2147483648
That's the index buffer
I simply can't bind that big of a buffer ๐ญ
Idk what to do about that
Device pointers? No lol
Already done... I opened an issue a few weeks ago
thank god for the wayback machine
void main() {
while (true) {
const uint currentTaskIndex = atomicAdd(completedTaskCount, 1);
if (currentTaskIndex > maxTasks) {
return;
}
const Cluster currentCluster = clusters[taskIndirection[currentTaskIndex]];
const Cluster parentCluster = clusters[currentCluster.parent];
if (ProjectError(currentCluster) <= threshold && ProjectError(parentCluster) > threshold) {
ProcessCluster(parentCluster);
} else {
ScheduleWork(parentCluster);
atomicAdd(completedTaskCount, -1);
}
}
}```
I guess idk
this doesn't say when all clusters have been processed though
ye this requires more thinking
According to some testing, my renderer performs worse than regular CPU-driven draw calls...
and has serious memory usage issues having to have a giant index buffer, it's basically unusable for naythign practicle :/
Feels like all my work was wasted unless I go try and add atomic image support and u64s to wgpu/naga
wgpu/naga moment,,,
I need one giant index buffer that stores a u32 per vertex of all possible triangles in the scene
and just binding that large of a buffer is a problem, I reach the max bindable limit...
Buffer binding 6 range 2905695912 exceeds max_*_buffer_binding_size limit 2147483648
isn't that the same if you use regular models 
And that's like nearly 3gb of just an index buffer
Not with instancing, no
ah
Because I have to allocate it per triangle, regardless of instancing
ok I see
Same with all the extra per-meshlet data, to a much lesser extent
surely there is a way you can reuse indices
I bet more indirection would do the trick
I don't see any way. I do a single draw_indirect() and encode the meshlet ID + triangle ID into each index, and then the vertex shader takes the vertex index (the index I wrote) and extracts the meshlet ID + triangle ID for it to get the vertex for.
or not quite meshlet ID + triangle ID, meshlet ID + meshlet index or something, idr exactly
does webgpu have MDI?
It does. Performed extremely poorly when I tested doing one draw per meshlet though.
From what I can tell, it's either mesh shaders, or software raster
yeah one subdraw per meshlet is not ideal
but maybe you can group them or use larger meshlets
or geometry shaders
(jk)
I think I'll at least try larger meshlets
I'm doing 64x64 meshlets, maybe 64x128 performs better
yeah
maybe doing way bigger meshlets would make mdi+instancing viable. I've seen one renderer use 1024 triangle meshlets
Though that was ray tracing
show nsight trace of both
btw I have an idea for software meshletisms
Will later TN
@primal shadow you up?
Yes, I'm at work though. I won't be able to test my stuff for another ~3-4h
ah rip
do you remember how much time it takes for you to rasterize bistro
both generating the index buffer and rasterizing the visbuffer
it doesn't have to be accurate
I never tested on bistro because I don't have a good system for converting whole scenes yet
ight, we'll test later with one of your scenes
but you can technically cut your memory consumption by a factor of 3 if you accept unindexed rendering
without degenerate tringles
Err, how?
I don't do indexed rendering as is really, the index buffer just encodes the meshlet and triangle data for the vertex shader to load
by doing this
if (gl_LocalInvocationID.x == 0) {
const uint currentIndexCount = atomicAdd(g_meshletDrawIndirectBuffer.VertexCount, meshlet.PrimitiveCount * 3);
s_meshletBasePrimitive = currentIndexCount / 3;
}
barrier();
const uint currentPrimitiveId = gl_LocalInvocationID.x;
if (currentPrimitiveId < meshlet.PrimitiveCount) {
g_meshletVisiblePrimitiveBuffer[s_meshletBasePrimitive + currentPrimitiveId] = meshletInstanceIndex << 7 | currentPrimitiveId;
}```
And this
const uint visiblePrimitiveIndex = gl_VertexIndex / 3;
const uint visiblePrimitive = g_meshletVisiblePrimitiveBuffer[visiblePrimitiveIndex];
const uint meshletInstanceIndex = visiblePrimitive >> 7u;
const uint primitiveId = visiblePrimitive & 0x7fu;
const uint primitiveCycle = gl_VertexIndex % 3;
const uint primitiveIndex = uint(g_meshletPrimitiveBuffer[meshlet.PrimitiveOffset + primitiveId * 3 + primitiveCycle]);
const uint vertexIndex = g_meshletIndexBuffer[meshlet.IndexOffset + primitiveIndex];
const vec3 position = g_meshletPositionBuffer[meshlet.VertexOffset + vertexIndex];```
That's what I currently have basically, no? You still have an index buffer of size triangle_count * 3
look closely
I never mul by 3 when indexing the "index buffer" (aka g_meshletVisiblePrimitiveBuffer)
it's sized meshletCount * triangleCount
or well, meshletInstanceCount * triangleCount
Oh so you're doing an indirect draw of vertex count size still
the size of the draw doesn't change yeah
But putting the triangle IDs in a separate buffer and having the vertices load them
Rather than the index buffer
it's exactly 3 times less memory
1gb of data per 250 million triangle instances (excluding asset data)
Still a lot, but more manageable than 3gb
The next step is just budgeting (and consequently, LoD)
Theoretically I need to allocate the whole amount regardless of lods though
No, you allocate a fixed budget
Unless I'm ok estimating and allocating a lower amount of data on the premise that it won't all be used due to LODs/culling
it's just the classic budgeting problem
allocate a fixed budget, work with that, hope that culling and LoD don't make you go over budget
Mhm
I wonder if I can do without uploading per triangle data at all
Just upload visible meshlet IDs and do some kind of prefix sum to inform how many triangles there are per meshlet, as a running count
And then the vertex shader would do some kind of binary search to find their meshlet id
ye that's also a possibility
interestingly, I still fall apart on disjoint meshes
I should probably detect those cases and fallback to kdtree partitioning, instead of adding more logic to this
this is a pain
it's 3am ffs
I eep
I need to test this on a real scene tbh...
I think my occlusion culling is not working ๐ค
Apparently my meshlet bounding spheres are not correct
Fixed occlusion culling
I also realized a fatal flaw in my shader resource table just now 
@wicked notch do you have any idea how to do culling for orthographic projections?
the same way you do for perspective projections
hrm
there is no change in the math if you use the hartmann-gribbs method to extract frustum planes
HZB remains precisely the same
I think adding fake connectivity edges with a huge weight should make this epic
TODO: look more into voronoi++
also figure out disjoint set union and vertex hash discretization
@wicked notch I kind of want to try LODs. What do I need to know?
@wicked notch https://jglrxavpok.github.io/2024/01/19/recreating-nanite-lod-generation.html they have some stuff about welding "close-enough" vertices together there
what an article to wake up to
that article seems like exactly what you are working on lol
now you can copy his homework. how serendipitous
he's using meshoptimizer for clusterization, which is what I'm struggling with
but the "welding vertices together" is extremely valuable info
so I'll be copying that indeed 
wasnt too far fetched then
voronoi is definitely helping me
I really gotta thank all the magic math guys
it's basically the dual of a delaunay's triangulation
I'm also about to commit crimes against unreal engine
none of these planes are connected together, all vertices are unique
thats how Cities Skylines 2 renders forests, no? ๐
real footage of me abusing unreal
hehe my engine does not support untringleized stuff
btw @frank sail
remember the eternally broken HZB?
for some reason it's not broken anymore
I have no idea why this even works
but if during the HzbCopy shader, instead of doing an initial reduction, you just take the sample and shove it into level 0 and then reduce from there
you get no flickering or artifacting of any kind
the gnomes invaded your PC and fixed it
can you try it out in frogfood as well, just to check whether I'm going crazy or not
because it actually works
I don't know why it actually works
can you show the code
after you wake up
erm what I mean is the code for this step
sounds like 1-2 lines
oh yeah
originally we took the nearest 4 texels or so
but it sounds like you want to do something different
it's just this ```glsl
void main() {
const ivec2 sourceSize = textureSize(g_inputDepthImage, 0);
const ivec2 targetSize = imageSize(g_hzbMainImage);
if (any(greaterThanEqual(gl_GlobalInvocationID.xy, targetSize))) {
return;
}
const vec2 ratio = vec2(sourceSize) / vec2(targetSize);
const ivec2 sourcePosition = ivec2(vec2(gl_GlobalInvocationID.xy) * ratio + 0.5);
const ivec2 destPosition = ivec2(gl_GlobalInvocationID.xy);
const float depth = GetSample(g_inputDepthImage, sourcePosition);
imageStore(g_hzbMainImage, destPosition, vec4(depth));
}```
GetSample has an additional safeguard where I do min(position, textureSize(depth, 0) - 1)
wait what does GetSample do exactly
and yet it does
is it just texelFetch
float GetSample(restrict readonly image2D image, ivec2 coord) {
return imageLoad(image, min(coord, imageSize(image) - 1)).x;
}```
I do not know
my guess is that your code subtly fails still
I can't test rn because I'm deep in this uncommitted code (and need to eep soon)
I tested on the classic intel sponza long tile and it works fine
which scene was used the last time we observed HZB failure?
after observing extensively for 5 minutes, I cannot see artifacts
it slightly underculls though
but it does cull something
it does cull a lot yes
when you say it underculls, does that mean it's bugged
driber update perhaps in between which fixed something?
I've had geforce experience that is nagging me to update my drivers for a while now 
or jaker made someone add IrisVk.exe to "the list"
ShooterGame.exe
We experienced some solar radiation storms up north which may have flipped the bit you needed to fix Hi-Z culling.
to avoid clogging the #engine-dev channel
we have unity
it is now time to reverse engineer the shit out of this engine
does it have hzb and vsm? ๐
hehe
i also saw this popping up in my recommendations https://www.youtube.com/watch?v=w99UcsgkUgE
This video is the first in a series of two lectures given by Keenan Crane at the Harvard FRG Workshop on Geometric Methods for Analyzing Discrete Shapes: https://cmsa.fas.harvard.edu/frg-2021/
Day II: https://www.youtube.com/watch?v=JQ2burHX710
Abstract: The intrinsic viewpoint was a hallmark of 19th century geometry, enabling one to reason ab...
seems to be a multipart-er thing
oh wow, unity has shit default shadows
peter panning mmm yes
oh damn unity uses dx11 by default 
it feels like dx11 is still the default for most things these days
just take a look at this capture
you're right tho 
besides the plane being 200 triangles of allah
they don't do any indirect shenanigans, meshletisms, HZBs
yeah I see z-prepass, CSM, rendering, postfx
and one other pass which might be bloom? I can't tell
it's SSS
ah
