#Iris - A Journey through OpenGL and beyond to learn Graphics

1 messages ยท Page 15 of 1

wicked notch
#

I guess I should also do something about this ๐Ÿ’€

wispy spear
#

: )

wicked notch
#

bistro be looking very square KEKW

wispy spear
#

its the bistro square after all

#

ah and because you dont take transforms into account, the lights are gone here too : >

wicked notch
#

yep :(

#

top priority right now is figuring out how UE deals with partitions bigger than the prim upper bound

#

and also fixing horrible meshlet islands

wispy spear
#

when is next exam?

wicked notch
#

23rd with a project deadline the 16th KEKW

wicked notch
#

I also have to think about how to store this DAG

#

we also need a name for this tech

#

current candidates are:

  • Frognite
  • Nanofrog
#

I'll also ask Suslik tomorrow if he still remembers how his clusterizer worked

wispy spear
#

Frogfrog

#

Nanonite

wicked notch
#

I can't sleep like this

#

I have a trillion ideas I want to try for this but exams and I gotta sleep

#

why does life

frank sail
#

when is the blog dropping btw

wicked notch
#

whenever I can form coherent sentences about graphs and kdtrees KEKW

#

I think I'll shelve the custom clusterizer for now and try using meshoptimizer's

#

maybe it's viable maybe it isn't, regardless I have about 20 more steps to get nanite

#

the next one is figuring out the DAG (and consequently meshlet borders, locking and adaptive simplification)

frank sail
#

I'll work on lumen while you do nanite

wicked notch
#

epic

frank sail
#

games

wicked notch
#

one day we'll merge our powers and combine VSM Lumen and Nanite

#

and then we'll get cease and desists from Epicbleakekw

frank sail
#

have you determined if it's possible to implement a subset of nanite and still observe a benefit

#

i.e., reduce the work for smaller gains

#

because having any auto lod at all is very useful

wicked notch
#

nanite is super scalable yeah

#

already I can decide the simplification factor

frank sail
#

I mean in terms of implementation effort

wicked notch
#

oh

#

I'm going for the simplest possible impl

#

So far it's a bumpy road

#

I think the major hurdles will be due to edge cases

#

don't care yet about those but I'll try to mitigate them when I see one

#

I also realized that I don't understand mesh shading at allagonyfrog

#

as in "what is a vertex index buffer"

#

"for indirection into the vertex buffer allowing vertex reuse"

#

is a very valid answer and the one I'd have given until yesterday

#

I gotta get a deeper understanding on that though to make goodโ„ข๏ธ partitioning

#

this doesn't matter if you just use meshopt for the initial clustering and "split" steps

#

although meshoptimizet is suboptimal for nanite, I confirmed that

#

long meshlets + potentially discontinuous island are terrible

#

this means that two different meshlets might share two borders

frank sail
#

is that why you were making your own clusteroni

wicked notch
#

which breaks some assumptions

wicked notch
#

I'll ask suslik for help tomorrow

frank sail
#

sucking is the first step to being kinda good at something

wicked notch
#

the two major issues with my own clusterizer right now are that my heuristics suck

#

and that partition size is not guaranteed

wicked notch
wispy spear
#

does it really matter?

wicked notch
#

yes

wispy spear
#

when you merge all tringles into one list, and then just partition by 128 tringles

wicked notch
#

mesh shaders have a strict upper bound

wispy spear
#

and fill up with NANs at the end

wicked notch
frank sail
wicked notch
#

subdividing might be a better one

wispy spear
#

ah

#

that i even understand : )

#

or you smuggle in a little low poly frog mesh

#

in 127 variations, to fill up all possible scenarios

#

when the last meshlet only has 1, 2, 3, 4, 5 etc vertices then the frog127, frog126, frog125etc mesh goes here ๐Ÿ™‚

frank sail
#

would be funny to use little frogs as a debug mesh

wicked notch
#

I do use the good froge

wispy spear
#

the big froge works nicely

wicked notch
#

it's one primitive, it's reasonably complex and it has meshlets of all shapes and kinds

frank sail
#

also that made me think about a possible low poly aesthetic for a game that uses as few vertices as possible to represent an object

wispy spear
#

the 2 vertex castle

#

in the distance

frank sail
#

VK_LINES castle

wispy spear
#

that one big tower which towers over all the other parts of it

#

you keep increasing GL_LINE_WIDTH until its big enough to switch to actual mesh, as you come closer

#

lustri, write your ideas down and then focus on your exams

#

unless you can convince sir brain to hire you out of school before all that

twin bough
#

budget unreal

#

damn

#

great job

wicked notch
#

this be a meshlet

#

how do I find the edges of said meshlet bleakekw

#

do I have to take into account winding order?

wispy spear
#

criver might know?

wicked notch
#

perchance

wicked notch
#

huh

#

maybe border edges can only connect two vertices

#

wait

#

I mean

#

border vertices can only be connected by two edges

#

or maybe 3

#

at most 3

#

can an inner vertex be connected by 3 edges though?

#

yes..

#

hmm

#

oh old on

#

a boundary edge is only referenced by one primitive

#

so if I encounter the same edge twice when scanning primitives then it's not a boundary edge

#

now if I have this index buffer [0, 1, 2]

#

How do I traverse it?

#

0 -> 1
1 -> 2
2 -> 0?

#

Rather how does the GPU do it

glass sphinx
#

yes

#

wait

#

traverse in what sense

#

why would it need it?

#

it just rasterizes

wicked notch
#

the rasterizer gets a list of vertices

#

let's say it's just a triangle

#

[0, 1, 2] is its index buffer

#

I guess the order of the indices defines the winding order?

glass sphinx
#

yea thats it

wicked notch
#

assuming the vertices stay static

glass sphinx
#

but that is defined by the positions as well

frank sail
#

the winding order is determined by the signed area of the triangle

#

which you can calculate from the determinant of the vertices or sumshit

wicked notch
#

right yeah

#

but I mean

#

the signed area changes if I swap around indices in the index buffer

frank sail
#

it do

wicked notch
#

ok I think I can do this maybe

frank sail
#

yes you can

wicked notch
#

where's my hashmap

wicked notch
#

my man Federico was rendering 373 million triangles meshes in 2008 with a fucking 6800 GT and OpenGL 2

wicked notch
#

it works

#

holy shit it works

#

let's fucking go

#

don't mind the uh

#

the uh

#
auto clusterEdges = std::vector<std::vector<std::pair<std::uint32_t, std::uint32_t>>>(currentClusters.size());
auto clusterEdgeCounts = std::vector<std::unordered_map<std::pair<std::uint32_t, std::uint32_t>, std::uint32_t>>(currentClusters.size());```
#

it's ok this is at build time it isn't meant to be performant

#

it's not that I can't code

#

nope, totally unrelated

pale horizon
#

Cache: sayonara...

wicked notch
#

might as well write CacheMiss<CacheMiss<CacheMiss<uint32>>>

pale horizon
#

List<List<Pair<
List<HashMap<...

hallow umbra
#

and a diet coke

frank sail
#

buy one of the X3D cpus for epic cache to minimize the damage of this situation

wicked notch
#

true

#

buy AMD

#

(not sponsored)

#

ok now we can partition

#

and then make a shrimplify and split

#

and then make a dag

#

then runtime LOD selection

#

very easy

wicked notch
#
for (auto index = 0; const auto& cluster : currentClusters) {
    auto clusterEdges = std::vector<std::uint64_t>();
    for (auto k = 0; k < cluster.triangle_count; ++k) {
        const auto basePrimitiveId = cluster.triangle_offset + k * 3;
        if (edge0.first > edge0.second) {
            std::swap(edge0.first, edge0.second);
        }
        if (edge1.first > edge1.second) {
            std::swap(edge1.first, edge1.second);
        }
        if (edge2.first > edge2.second) {
            std::swap(edge2.first, edge2.second);
        }
        clusterEdges.emplace_back(static_cast<std::uint64_t>(edge0.first) << 32 | edge0.second);
        clusterEdges.emplace_back(static_cast<std::uint64_t>(edge1.first) << 32 | edge1.second);
        clusterEdges.emplace_back(static_cast<std::uint64_t>(edge2.first) << 32 | edge2.second);
    }
    std::sort(std::execution::par, clusterEdges.begin(), clusterEdges.end());
    auto clusterBorderEdges = std::vector<std::uint64_t>();
    clusterBorderEdges.reserve(clusterEdges.size());
    if (clusterEdges[0] != clusterEdges[1]) {
        clusterBorderEdges.emplace_back(clusterEdges[0]);
    }
    for (auto k = 1; k < clusterEdges.size() - 2; ++k) {
        const auto& previousEdge = clusterEdges[k - 1];
        const auto& edge = clusterEdges[k];
        const auto& nextEdge = clusterEdges[k + 1];
        if (edge != previousEdge && edge != nextEdge) {
            clusterBorderEdges.emplace_back(edge);
        }
    }
    if (clusterEdges[clusterEdges.size() - 2] != clusterEdges[clusterEdges.size() - 1]) {
        clusterBorderEdges.emplace_back(clusterEdges[clusterEdges.size() - 1]);
    }
    ++index;
}```I made it less cancer
#

I guess now that I sort this I can make a sliding window search to look for shared meshlet borders

#

although borders can get pretty big

#

eh it's fine, CPUs are fast KEKW

wicked notch
#

and since I'm already breaking half of the GLTF spec, why not break it even more, I'll define custom attributes to store dual graph topology & stuff

wicked notch
#

hmm I guess edge weights now are the number of shared boundary edges

#

wait hold up

#
uniqueEdges = List<HashMap<uint64, uint64>>();
for (index, cluster) in clusters {
  clusterEdges = HashMap<uint64, uint64>();
  for (id, _) in cluster.primitives {
    const auto edges = [
      sort((cluster.triangles[id * 3 + 0], cluster.triangles[id * 3 + 1])),
      sort((cluster.triangles[id * 3 + 1], cluster.triangles[id * 3 + 2])),
      sort((cluster.triangles[id * 3 + 2], cluster.triangles[id * 3 + 0])),
    ];
    for edge in edges {
      ++clusterEdges[(edge.0 << 32) | edge.1];
    }
  }
  uniqueEdges[index] = clusterEdges
    .erase((edge, count) => count > 1);
}
clusterBorders = HashMap<uint64, List<(uint64, List<uint64>)>>();
for (i, _) in clusters {
  sharedEdges = List<uint64>();
  for (j, _) in clusters {
    if i == j {
      continue;
    }
    for edge in uniqueEdges[i] {
      if uniqueEdges[j].contains(edge) {
        sharedEdges.push(edge);
      }
    }
    if sharedEdges.isEmpty() {
      continue;
    }
    clusterBorders[i].push((j, sharedEdges));
  }
}```
#

damn

#

wait no still wrong

#

shit

#

ok now it's good (maybe)

#

ok pseudocode helps

wicked notch
#
std::unordered_map<std::uint64_t, std::vector<std::pair<std::uint64_t, std::vector<std::uint64_t>>>>()```
#

it got worse

wispy spear
#

use using

#

: )

wicked notch
#

That's heresy

wispy spear
#

i like how you exaggerated with the std:: for uint64_t ๐Ÿ˜„

wicked notch
#

I could've copy pasted my own integer types

#

but I feel like it's too late now bleakekw

#

Anyways I bothered Suslik again

#

I hope he doesn't kill me

wicked notch
#

shit

#

meshopt_simplify doesn't have a way for me to lock edges

#

fffffuck

wicked notch
#

how far is the future zeux

#

it's been two weeks

wispy spear
#

what were you aksing? ๐Ÿ˜„

wicked notch
#

It's another's PR

#

apparently a company is making nanite too bleakekw

frank sail
#

I bet you could get hired at traverse

#

Then implement nanite for them

wicked notch
#

it's a start

wispy spear
#

we soon need a meshlet containing emoji

wicked notch
#

shit

#

the partitions are wrong

#

how tf

#

ok time to go back to balls

#

I might need bigger balls though

#

goddamit

#

it's the disconnected meshlets

#

fuuuuck

#

meshopt_simplify also fails spectacularly to halve the number of triangles

#

damn

#

it's going wrong in so many different places at once froge_bleak

wicked notch
#

absolutely fantastic

#

beautiful even

wicked notch
#

my ability to code dwindles with every second

#

I don't even know why I assumed meshoptimizer was wrong instead of my ridiculously dodgy code

#

meshopt_simplify works so flawlessly it's scary

#

alright

#

no seams/tears

#

let's fucking gooo

#

but suzanne had a bit of a lobotomy KEKW

#

alright moment of truth

#

let's randomize vertex selection

#

those traverse guys are full of shit

#

beautiful unchanged borders

#

we're nearly done boys

#

all I have to do is the goddamn DAG and we win

wicked notch
#

vertex weights are bad somehow

#

I'm having way too much fun

#

this is glorious

wispy spear
#

: )

#

that makes me happy

#

is that froge again?

wicked notch
#

this is just a ball KEKW

wispy spear
#

oukay : )

wicked notch
#

we're inside a ball, looking out

#

it's easier to visualize this way

wispy spear
#

yeah i can see we are inside

wicked notch
#

man this is too good

#

we're almost there boys

#

and then I'll replace every one of your renderers with this

wispy spear
#

all this meshletism is because the gpu has an easier time scheduling work for those smoller islands of verticles?

wicked notch
#

ye

#

also better/easier culling

wispy spear
#

yeah

#

it sounds counter intuitive at first

#

more vertex groups to be aabb or whatever checked

#

than just the original mesh primitives

#

i cant wait to unlock this achievement too hehe

wicked notch
#

I'll delegate Jaker to write documentation for my nanite

#

so that you can go ahead and implement it

wispy spear
#

: D

#

Saky should implement it

wicked notch
#

oh yeah, potrick and saky were working on nanite 2 as well

wispy spear
#

were?

wicked notch
#

saky got stuck writing VSM then KEKW

#

and potrick got the culling brainworm

#

alright this is it for today

#

tomorrow I'll do stable partitioning and cross cluster boundary edge weights

#

and maybe I'll figure out how to hack glTF to store le epic clusters

wispy spear
#

sounds like a plan

wicked notch
#

now it's time to study

#

until I fall asleep

wicked notch
#

hmm maybe the usual limits are causing clusters to be too smol

#

with 128/128 I get sparser but more consistent clusters

#

the usual game of tradeoffs

wicked notch
#

ok 128/128 is gud enough

#

now

#

what do I do

#
  • store everything needed to build the DAG and build it at runtime
  • serialize the DAG and somehow also the vertex/index data referenced by said DAG
#

I feel like the first one is easier

wicked notch
#

friggin METIS is crashing again on edge weights

#

ok time for asan

wicked notch
#

goddamnit

#

this shit is so slow

#
for (auto k = 0; k < meshClusters.size(); ++k) {
    auto borderingClusters = std::vector<std::pair<std::uint64_t, std::vector<std::uint64_t>>>();
    for (auto z = 0; z < meshClusters.size(); ++z) {
        if (k == z) {
            continue;
        }
        auto sharedEdges = std::vector<std::uint64_t>();
        for (const auto& edge  : uniqueEdges[k]) {
            if (uniqueEdges[z].contains(edge)) {
                sharedEdges.emplace_back(edge);
            }
        }
        if (sharedEdges.empty()) {
            continue;
        }
        borderingClusters.emplace_back(z, std::move(sharedEdges));
    }
    clusterBorders[k] = std::move(borderingClusters);
}```
#

and no shit

#

O(n^2)

#

is indeed slow KEKW

wicked notch
#

Ah my weights are inverted

#

shiet

wispy spear
#

ill bring back the idea of 127 variants of a frog, to fillup the remaining vertices in a meshlet

wicked notch
#

no worries

#

I have fixed everything

#

โ„ข๏ธ

#

ok now I really have to start to think on how to store this shit

#

looking at meshlets at different LODs interact perfectly is fun

#

but I want to see them switching at runtime

#

so

#

glTF

#

wonderful format

#

how do I shove meshlet in it

wicked notch
primal shadow
wicked notch
#

epic

#

unfortunately I am going insane

#

so the border issue is irrelevant

#

but the "generating good meshlets" issue is still alive and kicking

frank sail
wicked notch
#

I'm using both METIS and my own graph algos

#

some stuff is just painful to implement KEKW

frank sail
#

ah

wicked notch
#

but it's ok, METIS isn't giving much trouble now that I know what the cryptic shit Karypis wrote in his document means

#

anyways the issue is

#

neither METIS nor I can generate functional meshlets

#

as in, I cannot generate meshlets that stay under the 128 primitive limit

#

I can't enforce that limit

#

further still, meshoptimizer still greatly prefers vertex reuse to spatial locality

primal shadow
wicked notch
#

I have the paper open in my browser but I still have to read it kekkedsadge

#

maybe I'll read it now, draw inspiration from it

#

Sponza is also a ridiculous mesh

frank sail
#

are the big triangles making you down

wicked notch
#

no

#

it's the pillars

#

it's a single mesh

#

so the clusterizer just assumes it's contiguous

#

but it isn't

frank sail
#

is dis old sponza

wicked notch
#

ye

frank sail
wicked notch
#

it's meshoptimizer's

#

but the same issue happens with graph partitioning

#

the absolute funniest thing

#

is that Unreal doesn't give a shit

#

and just clusterizes all of sponza at once

frank sail
#

hmm so make a big soup and then submit the triangles to the clusteroni

wicked notch
#

yeah

#

it's just one primitive

#

which is great, but you still have discontinuous islands in your graph

#

the pillars and the arches share no vertices

#

for some goddamn reason

frank sail
#

so don't put them in the same cluster

#

shrimple as that

wicked notch
#

and if I don't I get bad partitioning later because something that should have been a border, actually isn't

#

bad partitioning = less than ideal partitions

#

which in turns restricts the simplifier from simplifying

#

and in turn the split step cannot actually split anything due to the mesh being too simple

wicked notch
#

ok I figured out how to make the micro index buffer

#

thank you zeux for commenting your code KEKW

#

new approach

wicked notch
#

I think I cracked it

#

ok maybe I did crack it

#

This is unreal's ball

#

This is mine

#

I think this is good

buoyant summit
#

wow looks like your clusters are more rectangular too........

#

instead of a more stretched aspect ratio

wicked notch
#

Unreal generates twice as many clusters as well for some reason

#
LogStaticMesh:   Input: 4130 Clusters, 520344 Triangles and 1199795 Vertices
LogStaticMesh:   Output without splits: 4130 Clusters, 520344 Triangles and 1199795 Vertices
LogStaticMesh:   Output with splits: 8233 Clusters, 520344 Triangles and 1202735 Vertices
LogStaticMesh: Material Stats - Unique Materials: 1, Fast Path Clusters: 8233, Slow Path Clusters: 0, 1 Material: 8233, 2 Materials: 0, 3 Materials: 0, At Least 4 Materials: 0```
wicked notch
#

I gotta look at UE's source to figure out how the hell they're coercing their graph partitioning to stay true to the primitive limit

wicked notch
#

ok they just bisect the graph when it's greater than expected partition size

#

figures

wicked notch
#

ok, random ideas I got while I was studying

#
  1. One DAG per glTF primitive
  2. the DAG only has one root node if the primitive is less than 128 primitives or two leaf nodes and one root node if it is more than 128 but less than 255
  3. The simplifier reaching the target goal doesn't matter
  4. The cluster partitions must be at least composed of 4 clusters per partition
  5. Make a better algo for searching connected clusters because O(n^2) where n=numClusters is absolute shit kekkedsadge (it's probably parallelizable)
wispy spear
#

do you have to find all these clusters during runtime????

#

is all offline nonsense right?

wicked notch
#

ye this is all offline

wispy spear
#

then dont put too much brain-cell-lets into this

wicked notch
#

I think

#

maybe I'm approaching this in a wrong way

#

instead of hoping partitioning and clustering will work with generic meshes of a generic triangle count

#

why not make sure everything is a multiple of 128

#

so basically subdivision to make sure a mesh is always at least 128 triangles

#

let me check what happens if I feed unreal engine a single triangle

#

ok unreal doesn't tessellate/displace a cube or a triangle

#

I assume they simply skip all the steps and make the DAG with a single root node

#

yup

#
LogStaticMesh: Display: Building static mesh Mesh...
LogStaticMesh: Adjacency [0.00s], tris: 192, UVs 2
LogStaticMesh: Clustering [0.00s]. Ratio: 1.000000
LogStaticMesh: Leaves [0.00s]
LogStaticMesh: Reduce [0.00s]
LogStaticMesh: Fallback 0/1 [0.00s], num tris: 192
LogStaticMesh: ConstrainClusters:
LogStaticMesh:   Input: 3 Clusters, 318 Triangles and 277 Vertices
LogStaticMesh:   Output without splits: 3 Clusters, 318 Triangles and 277 Vertices
LogStaticMesh:   Output with splits: 3 Clusters, 318 Triangles and 277 Vertices
LogStaticMesh: Material Stats - Unique Materials: 1, Fast Path Clusters: 3, Slow Path Clusters: 0, 1 Material: 3, 2 Materials: 0, 3 Materials: 0, At Least 4 Materials: 0```
#

if a mesh is >128 but <255 then Unreal does some shenanigans

#

I assume this is simple subdivision and not catmull-clark

buoyant summit
#

catmull-clark subdivision would obviously be bad because it changes the surface

#

(as in the set of points)

#

tbh I kinda hate catmull-clark even when using it as an authoring tool

#

would be nice to have subdivision algorithm that as you subdivide tends to where you'd trace your shadows basically

wicked notch
#

yeah that would be nice

buoyant summit
#

the last two lines are me going on a tangent, just to be clear, ignore those

wicked notch
#

no more scuffed shadows on spheres

buoyant summit
#

scuffed shadows on curved surfaces is already a solved problem

wicked notch
#

anyways I'll try looking for some subdivision algorithms/libraries that can guarantee me N output triangles

#

any suggestions?

buoyant summit
#

tbh I'm not sure that can exist

#

for general N

wicked notch
#

yep "guarantee" is a strong word

#

I want something that can at least approach N

buoyant summit
#

you could just split random edges (preferrably ones that have big triangle lie on them so that your triangles tend to be about the same size) until you arrive at N

wicked notch
#

hmm

#

makes sense

#
if (primitiveCount % 128 > 32) {
  area = CalculatePrimitiveArea(mesh);
  edges = CalculateWeightedPrimitiveAdjacency(mesh, area);
  subdivided = SubdivideMesh(mesh, edges);
}```
buoyant summit
#

btw obviously you can subdivide an edge not into two edges but e.g. 3 edges

#

or 5 edges

#

maybe that's worthwhile

wispy spear
wicked notch
#

this is catmull-clark which changes the surface so unfortunately not for me

#

but maybe I can apply this knowledge

primal shadow
#

@wicked notch do you have a reference for calculating tangents from screen space derivitaves?

#

Something that can be used with the miktspace normal system bevy already uses

wicked notch
#

Otherwise I now prefer just calculating tangents on the host if they're missing

wicked notch
#

so here's another thing I didn't understand at first

#

you don't use a graph partitioning algorithm to do clusters and hope they don't go over 128

#

you use a graph partitioning algorithm to recursively bisect a graph and then until the partitions are the right size

wicked notch
#

ok graph bisection

#

are Unreal peeps smoking crack or am I dumb

#
    real_t PartitionWeights[] = {
        float( TargetNumPartitions / 2 ) / TargetNumPartitions,
        1.0f - float( TargetNumPartitions / 2 ) / TargetNumPartitions
    };```
frank sail
wicked notch
#

isn't this just [0.5, 0.5]

#

x / 2 / xis just 1/2

frank sail
#

what is the type of TargetNumPartitions

wicked notch
#

oh yeah

frank sail
#

presumably not float

wicked notch
#

le integer divison

frank sail
#

in programming, we do special math

wicked notch
#

ok so for a graph with N nodes

#

I gotta do this

#
minPartitionSize = 124;
maxPartitionSize = 128;
targetPartitionSize = (minPartitionSize + maxPartitionSize) / 2;
targetPartitionCount = Max<int32>(2, Round(nodesCount / float32(targetPartitionSize)));
partitionWeights = [
  float32(targetPartitionCount / 2) / targetPartitionCount,
  1.0 - (float32(targetPartitionCount / 2) / targetPartitionCount),
];```
#

and so this way if I have 385 nodes

#

I do
targetPartitionCount = Max<int32>(2, Round(385 / float32(126))); which is just 3
float32(3 / 2) / 3 which is 0.333
1.0 - float32(3 / 2) / 3 which is 0.666

#

ok good

#

then what I do is

wicked notch
#
BisectGraph(graph) {
  (_, partitions) = Partition(2, graph);
  front = 0;
  back = graph->GetVertexCount();
  swap = [];
  while front <= back {
    while front <= back && partitions[front] == 0 {
      swap[front] = front;
      front++;
    }
    while front <= back && partitions[back] == 1 {
      swap[back] = back;
      back--;
    }
    if front < back {
      swap[front] = back;
      swap[back] = front;
      front++;
      back--;
    }
  }
  split = front;
  partitionSize = [split, graph->GetVertexCount() - split];
  // make new graphs if partition size still too big
}

RecursiveBisectGraph(graph) {
  output = BisectGraph(graph);
  if output[0] && output[1] {
    RecursiveBisectGraph(output[0]);
    RecursiveBisectGraph(output[1]);
  }
}```
wicked notch
#

yep

wispy spear
#

and 1.0f / 3 is 0.33

wicked notch
#

the point is to have partition weights be either 0.333, 0.666 (or more) or 0.5, 0.5 (or less)

wispy spear
#

ok, looks like you just didnt update your calculation there : )

wicked notch
#

I hate the fact that I've gotten used to reading unreal's source bleakekw

pale horizon
#

Donโ€™t say it to Tim to not be thrown into his CBT dungeon

twin bough
#

Inb4 lvstri gets poached by epic

pale horizon
#

Inb4 lvstri is actually Tim in disguise KEKW

wispy spear
#

inb4 lvstri works at epic already, and uses "exams" as excuses for "i need to finish implementing nanonite 3.0 with my father brian or he will shoe me again"

pale horizon
#

Same vibes

wicked notch
#

I bisected a graph

#

and I took an entire friggin hour to understand whatever the hell I was doing

wicked notch
frank sail
#

but ye did it lad frogapprove

wicked notch
#

this is so unintuitive it's crazy

wicked notch
#

I also gotta remap the remap

#

beautiful

wicked notch
#

unreal does a radix sort of some kind

#

and then hashes the triangle IDs with the distance from their center as the hash

#

wtf

wicked notch
#

ahh yep

#

as usual, more problems came up!

#

maybe if I allow looser partitioning I can solve this easily

wicked notch
#

nope

#

sigh

wicked notch
#

Ah I see

#

it's fucking foliage

#

amazing

#

the offending things are always foliage because they have the most disjoint meshes

#

(and some windows, because they are the same mesh for some goddamn reason, so their triangles are disjoint as well)

#

I could maybe tessellate

wispy spear
#

are you doing that per primitive?

wicked notch
#

ye

#

oh yeah another thing I could do is per tringle range materials

#

like unreal does

wispy spear
#

crazy that this is so hard

wicked notch
#

I can't even see what unreal does with bistro because it can't import bistro kekkedsadge

#

it literally crashes while building nanite meshes

#

anyways, brain fog is really starting to kick in so I'll leave this for past 16th me

wispy spear
#

hmm debugging unreal is not an option neh?

wicked notch
#

I'll try tomorrow

#

but unreal is just pain

#

it takes 10 minutes to import bistro when compiling unreal in release mode

#

one order of magnitude more in debug mode

pale horizon
#

EPIC, HIRE THIS MAN KEKW

wicked notch
#

brian already solved foliage kekkedsadge

wispy spear
#

time to make an unreal without all the clutter and baggage from the past : >

#

to be able to import bistro hassle free

#

to be able to implement VSM 2.0 and nanokatzeite 3.0

wicked notch
#

ok I solved it

#

with a garbage solution

#

(even more bisections)

#

holy shit it's garbage

#

ok subdivision is literally a requirement

#

TODO: make subdivision

wispy spear
#

let delauny and worley come to you in your dreams

wicked notch
#

OpenSubdiv is covered by the Apache license, and is free to use for commercial or non-commercial use. This is the same code that Pixar uses internally for animated film production. Our intent is to encourage high performance accurate subdiv drawing by giving away the "good stuff".

#

I didn't know Pixar was this chad

primal shadow
#

@wicked notch can it ever happen that a meshlet rendered in the first pass (was visible last frame and not frustum culled this frame) won't be visible after performing occlusion culling int he second pass?

wicked notch
#

no, the second pass only cares about what wasn't visible before

primal shadow
#

hmm ok, so I need to find why things are breaking...

#

oh wait lol

#

I'm reusing the preivous oclcusion buffer for the current, but never clearing it ๐Ÿ˜›

distant lodge
#

OpenSubdiv is pretty nice iirc, I believe it's what blender uses for at least one of its subdivision modifiers

#

I recall cloning it a while back for that reason

primal shadow
#

Fixed bugs, got a lot more performance back lol

#

I was accidentally reusing a buffer without clearing it, so once a meshlet became visible it would never become unvisible, leading to the first pass basically rendering every meshlet every time

#

And then for shadow views, I didn't realize the frustum was not setup to be uploaded to the GPU, so culling wasn't working at all

#

So I fixed those bugs and now it's much faster lmao

wicked notch
#

alright, opensubdiv is nice but it's slow as hell

#

before I jump into opensubdiv's gpu backends I'd want to try MT

wicked notch
#

this worked incredibly well though

#

holy shit

primal shadow
#

MT?

wicked notch
#

multithreading

#

btw I now understand what Jensen meant by "the more you buy the more you save"

#

the more you draw the more perf you have

#

that's right, the more I tessellate the better the clusterizer does

frank sail
#

@wicked notch how do you cope with buffers in Vulkan

#

particularly updating them

#

I sense three usages:

  1. Update whole thing every frame
  2. Update occasionally
  3. Never update from CPU (except maybe clearing it with the clear command)
#

Anyways 1 and 3 are easy to solve in a vacuum

wicked notch
#

my public API for buffers is kinda barebones at the moment, I just have this

auto Write(const T& value, uint64 offset = 0) noexcept -> void;
auto Write(std::span<const T> values, uint64 offset = 0) noexcept -> void;
auto Write(const void* data, uint64 size, uint64 offset) noexcept -> void;```
frank sail
#

is this buffered

wicked notch
#

no unbuffered

frank sail
#

alright so you manually buffer it

wicked notch
#

yeah

frank sail
#

๐Ÿฅ–

wicked notch
#

I think it may be worth to have something like BufferedTypedBuffer<T> or something

#

even though the name is garbage

frank sail
#

The problem is coming up with a unified abstraction for these uses

#

But it's probably not that hard, just inherit fam

wicked notch
#

yep

#

don't inherit too much otherwise you end up like java

#

new InputStreamReader(new BufferedInputReader(new StreamAdapter(new FuckThisShit())))

frank sail
#

I could template too but then everything is now in the header bleakekw

#

Btw I think devsh handles case 2 by using an upload "pool"

#

That is the rarest case tbh and I think I only do it for scene geometry updates

#

I suppose I could hack it by treating those as N-buffered, but rip memory

wicked notch
#

tbh unless you're moving lots of MiB of data around per frame, it's probably fine to just write it all

#

or at least, update sparsely without buffering

#

for everything else I think the staging pool that devsh presents is quite good

frank sail
#

hmm I better get crackin on this

pale horizon
cold sky
#

my thing is only more optimal when you do 64mb one frame, and 1mb another

frank sail
#

yeah there's nothing you can do when you write the whole thing every frame

cold sky
distant lodge
#

devsh style upload pools are nice but I still n-buffer stuff I write out every frame

#

I mainly use them for staging mesh/texture uploads

cold sky
#

sooon depri will give me "Even More Power" (TM)

#

hue hue hue hue hue

frank sail
#

devsh palpatine incident

distant lodge
#

unlimited buffering

cold sky
#

well he's rigging the TimelineSemaphore Deferred std::function<> to a Pool Allocator

#

for dem bindless descriptor sets

distant lodge
#

that's what I do for my fixed size upload pool essentially but I don't see where bindless comes in

cold sky
#

so I can allocate/deallocate

cold sky
#

when I'm done using a texture and want to mark a descriptor array item as ready for overwrite, I want to do free_addr(slot) right?

#

but I probably want to latch that on the semaphore value that the last frame that uses said texture will signal

distant lodge
#

what I do for that is I have FIF implicit linked freelists and append them to the tail of the global free list

cold sky
#

yea, but that dumb

#

it kinda implies you only have one TS

distant lodge
#

so basically I have a std::vector<BindlessHandle> that's all live and dead handles, then BindlessHandle m_waitingFrees[FRAMES_IN_FLIGHT]; and BindlessHandle m_freeListHead/Tail;

#

yeah but you can easily extend this to work with more TS if you get rid of the FIF

cold sky
#

eeeeh

#

trust me I thought about it

#

it gets suuuuper messy with multi-threading

#

also I don't just do free lists, I do arbitrary functors

distant lodge
#

just get better locks

cold sky
#

it just so happens that 99% of the time that functor is a free/deallocate functor

distant lodge
#

the functors do throw a wrench in it yeah but with update-after-bind and whatnot the only thing that's in my critical section is disturbing the index list for alloc/free

#
  • I use those fancy 1 byte webkit style locks
cold sky
distant lodge
#

and yielding switches fibers instead of blocking the thread

#

so just locking is no big deal for me

cold sky
#

and you also don't want to run the functor from some unexpected thread/actor

#

this is why my latch lists are partitioned per-semaphore-per-resource

#

i.e. for the same semaphore Down Streaming Buffer doesn't keep its latched frees interleaved with Up Streaming Buffer and MegaDescriptorSet latched frees

#

We don't have a master-cleaner/submitter/waiter thread, so sticking the deferred events in the semaphore makes no sense

#

also you need to recognise those systems for what they are == garbage collection

cold sky
#

I dont want to be running data-download consumption host callbacks when I just want to poll if I can free a single descriptor when I allocate a slot, all because I stick all my events on the same semaphore's queue

#

if I slapped all events in the same queue, I'd be susceptible to random pauses like that

distant lodge
#

I just give stuff like that a dedicated update() method, though I guess you'd still be susceptible to the pauses simply from the fact you contend the same lock as the GC cycle

cold sky
#

its also kinda important when the deferred event is a free and the resource is not thread-safe

#

at least my event queues live in the resource, so if I externally synchronise access to the resource, I wont get any nasty surprises like unsynchronized callback execution

cold sky
#

thats a free latched with a data consumption callback when you're moving memory GPU -> Host

#

so you can actually download a 2GB vkBuffer through a 64MB HOST_VISIBLE

#

in lots of little submits of 1 copy command

#

and every single time you overflow (can't allocate) the free callbacks will get run until a timeout

wicked notch
#

this is a bigger brainworm than I'd recommend for safe consumption

#

the epic staging pool is good but going farther than that with this stuff is nervous

distant lodge
#

I see that it functions like an upload buffer but the opposite way around essnetially

#

but I also think having thread safety issues with the output of your intermediate data download functors is a skill issue

distant lodge
cold sky
frank sail
#

glBufferSubData

distant lodge
#

mine aren't stge 5 brainworm yet and don't do the crazy stuff devsh's do like converting your image format live

cold sky
#

cause when you want to push 2GB of data to Device from Host througha 64MB buffer, you just need to wait for Device to finish copying the previous chunk of staging buf to destination before you overwrite it (the staging)

#

download buffer is faaar weirder

#

cause Host needs to do shit with the data in the 64MB mapped staging BEFORE you free it (and device overwrites with a new chunk)

#

so you need a callback that runs AFTER semaphore signal and BEFORE the free, which you need to perform to make progress in the copy-submit loop

distant lodge
#

right ok I'm starting to see

cold sky
#

in the upstreaming direction you already have the 2GB Host source and 2GB Device destination laying about

#

and if don't you'll naturally make multiple calls exactly the chunk size that the Host produces at a time

distant lodge
#

so with multithreading it, on top of everything else, you have to take care that the execution time of the free functors doesn't bleed into the execution time of enqueueing

cold sky
#

I can explicitly force all ready functors to run with a cull_frees()

#

or just block XD

#

while (m_downStreamingBuffer->cull_frees()) {}

#

๐Ÿ˜›

#

generally speaking what the buffer will do is try to run 1 or 2 functors every time you allocate

#

via a poll()

#

cause otherwise your allocator runs out of memory and you get faux fragmentation

#

ofc if it can't service your request, then it will do a wait instead of a poll

#

the allocate call takes a timeout

distant lodge
#

the way I'd do it is probably just to have a dedicated update() method that calls a parallel scatter of the functors with a gather function pending on it to free the indices when it's done, if I was assuming that running them synchronously would affect execution elsewhere

#

and if there's no space to allocate, just enqueue your functor into the not ready list directly

cold sky
#

after upgrading to timeline semas

#

I literally never have to run that

#

unless I want the functors to run "earlier than"

cold sky
distant lodge
#

so it's valid usage to allocate staging buffers that get deleted when you're done staging?

cold sky
#

yep

#

it just that the destructor might stall you

#

you'll deadlock if you havent submitted the copies yet though

distant lodge
#

funky, I'd much rather have to remember to call the update method lol

#

doesn't sound nearly worth it for surprise sync points

cold sky
#

its totally worth it

#

cause we can write code that does not give shits about overflows

#

like an faux-immediate mode CAD renderer

#

that can attempt to draw 400MB of CPU produced geometry through a 64MB ReBAR buffer

distant lodge
#

mine can handle the overflows, I just need to have an uploadPool.update() in my update loop somewhere

cold sky
#

yeah and we don't

#

you just check the condition for "would be overflow"

#

and do a "hidden transparent submit"

#

which signals the semaphore value that you latched all the resource frees from waaaay before you knew there'd be an overflow

#

then you wait on the semaphore

#

reset the commandbuffer and begin it again

#

go back to business as usual with the rest of the code
and while its trying to allocate all those resources, it unknowingly releases resources from deferred functors

distant lodge
#

I do all that except replace waiting on the semaphore for enqueueing the functor

cold sky
#

you need to wait on the semaphore cause you've ran out of space

distant lodge
#

I opted to structure specifically around never waiting on anything

cold sky
#

literally no forward progress can be made

distant lodge
#

yeah that's why I have the update function though, it does the same TS check you'd do when enqueueing another write job, except if nothing is ready I don't block any threads

#

which is much better for me because fibers = avoid OS blocks

cold sky
#

potatoe/tomatoe

#

you still need to suspend exxecution of the "Immediate Draw" routine

#

either by waiting, switching or yielding

distant lodge
#

yeah that's true, luckily I don't have to port GDI/oldGL code lol

cold sky
#

the thing is, with your system I'd have to spam update() literally everywhere

#

so I can make progress

#

or whenever an allocation fails

#

my update() is just rolled into the allocation routine implicitly

distant lodge
#

yeah I guess the major downside is you have to architect everything around being async-friendly

cold sky
#

wdym everything?

distant lodge
#

everything downstream of your uploads and downloads I mean

cold sky
#

nope

distant lodge
#

I meant with my system

#

if you block on stuff you don't

cold sky
#

the whole code outside of the utilities thinks the commandbuffer never got reset

#

it goes into the function begun

#

and comes out begun as well

#

the only way it can know from the outside that there was a hidden submit is because the nextSemaphoreWait.value has incremented

cold sky
#

right now

#

to overlap Host and Device better

#

maybe I should have a lambda or some condition which causes artificial allocation failure / overflow

#

i.e. if I have 128MB staging buffer, so the host doesn't suballocate the whole 128MB, write it, submit and block

#

but rather tries to request 32 or 64mb at a time max

#

and if there's only 96 or 64mb free, treat it as an overflow

#

then I could have it block on a much older submit

#

eeeeh

#

but then I'd need multiple commandbuffers to round-robin

#

cause I can't reset a pending one

#

hmm

#

ok I might actually implement this as an improvement

#

aside from the intendedNextSubmit where the last cmdbuf is the resettable scratch, I could have N of them going round robin

#

that will be stage 69 brainworm

#

then if you have the right sized N, the actual function with overflows might actually never block (if Host is slow enough on the callbacks or data filling)

distant lodge
#

honestly that's a pretty neat idea, would need some tuning to be worthwhile though

cold sky
#

I'm not really hungry for perf

#

Just usability and correctness

wicked notch
#

back to nanite

#

I have sent the project in

#

alright so tessellation is done (it saves my ass so much you have no idea, at least for the initial clustering)

#

strict graph partitioning is done

#

simplification is done

#

what's left is

  • figure out how I'm gonna store this shit in glTF
  • building the DAG
frank sail
#

LVSTRI_nanite_at_home

wicked notch
#

we're almost done boys

#

just a little more and we'll have flawless LODs

#

I accept suggestions for storing meshletisms in glTF

#

like does one mesh = one LOD group?

#

how do I express the parent - child relationship of the DAG

frank sail
#

via edges

wicked notch
#

yeah but in gltf

frank sail
#

hehe idk

wicked notch
#

maybe it's just better if I make a custom format

frank sail
#

hmm copy how nodes do it?

#

you could store it as a custom binary blob in the gltf

wicked notch
#

my DAG is gonna be CSR

#

so it's just two arrays

frank sail
#

CSR?

wicked notch
#

compressed storage row

frank sail
#

wat dat

wicked notch
#

like a graph that connects vertex 0 to 2,3 - vertex 1 to 0 and 2 and vertex 2 to 0, 1 and 3 is just written like this
[0, 2, 4, 7]
[2, 3, 1, 0, 0, 1, 3]

frank sail
#

ah noice

#

so is each row a layer of the dag

#

i.e. the first row is just root nodes

wicked notch
#

the first array is offsets into the second array

#

the second array is edges

frank sail
#

o

wicked notch
#

I shall explain better

#

vertex 0 => look up range in the first array: start = range[0]; end = range[1], edges connecting vertex 0 are in the second array starting at 0 and ending at 2 (excluded)
vertex 1 => look up range in the first array: start = range[1]; end = range[2], edges connecting vertex 1 are in the second array starting at 2 and ending at 4 (excluded)
...

frank sail
#

interesting

#

I see they use this for sparse matrices too

#

btw how do you select cuts at runtime kekkedsadge

primal shadow
wicked notch
#

process all leaf clusters, if their parent's error is too big but the children are good, add them to the list, otherwise add the parent to another list for further processing
repeat until all clusters have been processed

primal shadow
#

Also @wicked notch when you're done please make a write up of the preprocessing algorithm and save me a lot of future pain ๐Ÿ˜„

#

And tbh how you structured the dispatches over meshlets/instances would be good to compare how I did it against, but less important.

wicked notch
#

remember that I use mesh shaders

primal shadow
#

Oh you do, nvm. Well actually no, I'd still like to know

wicked notch
#

I use your same approach btw if I recall

primal shadow
#

Doing a dispatch with one workgroup per meshlet takes like 0.1ms per 2^16 meshlets, which doesn't scale great :((

#

To write out the index buffer

wicked notch
#
  uint32 IndexOffset;
  uint32 IndexCount;
  uint32 PrimitiveOffset;
  uint32 PrimitiveCount;
};

struct MeshletInstance {
  uint32 MeshletIndex;
  uint32 TransformIndex;
  uint32 MaterialIndex;
  // other per instance data
};```
wicked notch
primal shadow
#

No they're seperate

wicked notch
#

otherwise idk, perchance check what nsight has to say

primal shadow
#

Which improved things a lot, but it's still a bit expensive

primal shadow
frank sail
wicked notch
wicked notch
#

ask the wgpu lads to up the limit KEKW

#

2^16 as workgroup limit is kinda garbage

frank sail
primal shadow
#

But back on topic, please do a writeup of the preprocessing ๐Ÿ™‚

#

When you finish*

frank sail
#

Lowest common denominator API bleakekw

wicked notch
#

(it isn't)

#

it's just the edge cases

#

they're literally limitless

#

infinite edge cases

primal shadow
frank sail
primal shadow
frank sail
#

2^16 * 128 threads is still like 8 million though

primal shadow
#

It's 1 workgroup per meshlet

wicked notch
#

i guess you can dispatch more in the second dimension

#

should work just fine tbh

primal shadow
#

Hmm, perhaps? Let me check what the max overall dimensions are.

#

That's a good idea if it's possible though

wicked notch
#

btw @frank sail can you link again that thing about persistent threads in kompute

frank sail
#

Uhh

#

Was it a paper

wicked notch
#

my froge brain does not allow me to remember events past 2 weeks old

faint crane
#

โ€œIt was revealed to me in a dream.โ€

wicked notch
#

should've just googled smh

#

who's gonna tell LVSTRI

#

oh yeah this is the part that requires UB to work

#

fun

#

wait hold up

delicate rain
#

By UB do you mean "Unreal 5 Behavior" ?

wicked notch
#

ight so

#

MPMC queues on the gpu

#

what's the exit condition

#

and can I just atomicAdd(taskCount, -1)

#

nope

#

alright

primal shadow
#

So uhh trying to spawn 40x40 Stanford bunnies leads to
Buffer binding 6 range 2905695912 exceeds max_*_buffer_binding_size limit 2147483648

#

That's the index buffer

#

I simply can't bind that big of a buffer ๐Ÿ˜ญ

#

Idk what to do about that

wicked notch
#

uh

#

BDA?

#

you have that right?

primal shadow
#

Device pointers? No lol

wicked notch
#

damn

#

time to write another wgpu complaint then

primal shadow
#

Already done... I opened an issue a few weeks ago

wicked notch
#

thank god for the wayback machine

#
void main() {
  while (true) {
    const uint currentTaskIndex = atomicAdd(completedTaskCount, 1);
    if (currentTaskIndex > maxTasks) {
      return;
    }
    const Cluster currentCluster = clusters[taskIndirection[currentTaskIndex]];
    const Cluster parentCluster = clusters[currentCluster.parent];
    if (ProjectError(currentCluster) <= threshold && ProjectError(parentCluster) > threshold) {
      ProcessCluster(parentCluster);
    } else {
      ScheduleWork(parentCluster);
      atomicAdd(completedTaskCount, -1);
    }
  }
}```
#

I guess idk

wicked notch
#

this doesn't say when all clusters have been processed though

#

ye this requires more thinking

primal shadow
#

According to some testing, my renderer performs worse than regular CPU-driven draw calls...

#

and has serious memory usage issues having to have a giant index buffer, it's basically unusable for naythign practicle :/

frank sail
#

why does the index buffer need to be giant

#

bigger than usual?

primal shadow
#

Feels like all my work was wasted unless I go try and add atomic image support and u64s to wgpu/naga

buoyant summit
#

wgpu/naga moment,,,

primal shadow
#

I need one giant index buffer that stores a u32 per vertex of all possible triangles in the scene

#

and just binding that large of a buffer is a problem, I reach the max bindable limit...

#

Buffer binding 6 range 2905695912 exceeds max_*_buffer_binding_size limit 2147483648

frank sail
primal shadow
#

And that's like nearly 3gb of just an index buffer

frank sail
#

ah

primal shadow
#

Because I have to allocate it per triangle, regardless of instancing

frank sail
#

ok I see

primal shadow
#

Same with all the extra per-meshlet data, to a much lesser extent

frank sail
#

surely there is a way you can reuse indices

#

I bet more indirection would do the trick

primal shadow
#

I don't see any way. I do a single draw_indirect() and encode the meshlet ID + triangle ID into each index, and then the vertex shader takes the vertex index (the index I wrote) and extracts the meshlet ID + triangle ID for it to get the vertex for.

#

or not quite meshlet ID + triangle ID, meshlet ID + meshlet index or something, idr exactly

frank sail
#

does webgpu have MDI?

primal shadow
#

It does. Performed extremely poorly when I tested doing one draw per meshlet though.

#

From what I can tell, it's either mesh shaders, or software raster

frank sail
#

yeah one subdraw per meshlet is not ideal

#

but maybe you can group them or use larger meshlets

#

or geometry shaders bleakekw (jk)

primal shadow
#

I think I'll at least try larger meshlets

#

I'm doing 64x64 meshlets, maybe 64x128 performs better

frank sail
#

what are the two numbers

#

i forgor

#

vertices x triangles?

primal shadow
#

yeah

frank sail
#

maybe doing way bigger meshlets would make mdi+instancing viable. I've seen one renderer use 1024 triangle meshlets

#

Though that was ray tracing

wicked notch
#

btw I have an idea for software meshletisms

primal shadow
wicked notch
#

@primal shadow you up?

primal shadow
wicked notch
#

ah rip

#

do you remember how much time it takes for you to rasterize bistro

#

both generating the index buffer and rasterizing the visbuffer

#

it doesn't have to be accurate

primal shadow
#

I never tested on bistro because I don't have a good system for converting whole scenes yet

wicked notch
#

ight, we'll test later with one of your scenes

#

but you can technically cut your memory consumption by a factor of 3 if you accept unindexed rendering

#

without degenerate tringles

primal shadow
#

Err, how?

#

I don't do indexed rendering as is really, the index buffer just encodes the meshlet and triangle data for the vertex shader to load

wicked notch
#

by doing this

if (gl_LocalInvocationID.x == 0) {
    const uint currentIndexCount = atomicAdd(g_meshletDrawIndirectBuffer.VertexCount, meshlet.PrimitiveCount * 3);
    s_meshletBasePrimitive = currentIndexCount / 3;
}
barrier();

const uint currentPrimitiveId = gl_LocalInvocationID.x;
if (currentPrimitiveId < meshlet.PrimitiveCount) {
    g_meshletVisiblePrimitiveBuffer[s_meshletBasePrimitive + currentPrimitiveId] = meshletInstanceIndex << 7 | currentPrimitiveId;
}```
#

And this

const uint visiblePrimitiveIndex = gl_VertexIndex / 3;
const uint visiblePrimitive = g_meshletVisiblePrimitiveBuffer[visiblePrimitiveIndex];
const uint meshletInstanceIndex = visiblePrimitive >> 7u;
const uint primitiveId = visiblePrimitive & 0x7fu;
const uint primitiveCycle = gl_VertexIndex % 3;
const uint primitiveIndex = uint(g_meshletPrimitiveBuffer[meshlet.PrimitiveOffset + primitiveId * 3 + primitiveCycle]);
const uint vertexIndex = g_meshletIndexBuffer[meshlet.IndexOffset + primitiveIndex];
const vec3 position = g_meshletPositionBuffer[meshlet.VertexOffset + vertexIndex];```
primal shadow
#

That's what I currently have basically, no? You still have an index buffer of size triangle_count * 3

wicked notch
#

look closely

#

I never mul by 3 when indexing the "index buffer" (aka g_meshletVisiblePrimitiveBuffer)

#

it's sized meshletCount * triangleCount

#

or well, meshletInstanceCount * triangleCount

primal shadow
#

Oh so you're doing an indirect draw of vertex count size still

wicked notch
#

the size of the draw doesn't change yeah

primal shadow
#

But putting the triangle IDs in a separate buffer and having the vertices load them

#

Rather than the index buffer

wicked notch
#

you could also load the primitive indices directly here

#

it doesn't really matter

primal shadow
#

Sensible yeah. Still a lot of data, but less so.

#

Let me try that later, thanks

wicked notch
#

it's exactly 3 times less memory

primal shadow
#

1gb of data per 250 million triangle instances (excluding asset data)

#

Still a lot, but more manageable than 3gb

wicked notch
#

The next step is just budgeting (and consequently, LoD)

primal shadow
#

Theoretically I need to allocate the whole amount regardless of lods though

wicked notch
#

No, you allocate a fixed budget

primal shadow
#

Unless I'm ok estimating and allocating a lower amount of data on the premise that it won't all be used due to LODs/culling

wicked notch
#

it's just the classic budgeting problem

#

allocate a fixed budget, work with that, hope that culling and LoD don't make you go over budget

primal shadow
#

Mhm

primal shadow
#

I wonder if I can do without uploading per triangle data at all

#

Just upload visible meshlet IDs and do some kind of prefix sum to inform how many triangles there are per meshlet, as a running count

#

And then the vertex shader would do some kind of binary search to find their meshlet id

wicked notch
#

ye that's also a possibility

wicked notch
#

interestingly, I still fall apart on disjoint meshes

#

I should probably detect those cases and fallback to kdtree partitioning, instead of adding more logic to this

#

this is a pain

#

it's 3am ffs

#

I eep

primal shadow
#

meshlets

#

regular renderer

primal shadow
#

I need to test this on a real scene tbh...

primal shadow
#

I think my occlusion culling is not working ๐Ÿค”

primal shadow
#

Apparently my meshlet bounding spheres are not correct

primal shadow
#

Fixed occlusion culling

wicked notch
#

fuck it

#

I'll just DFS into the graph to find disconnected-ness

wicked notch
#

I also realized a fatal flaw in my shader resource table just now bleakekw

primal shadow
#

@wicked notch do you have any idea how to do culling for orthographic projections?

wicked notch
#

the same way you do for perspective projections

primal shadow
#

hrm

wicked notch
#

there is no change in the math if you use the hartmann-gribbs method to extract frustum planes

#

HZB remains precisely the same

primal shadow
#

It was the way I was converting depth ๐Ÿ˜›

#

Forgot I had to adjust that for ortho

wicked notch
#

I think adding fake connectivity edges with a huge weight should make this epic

wicked notch
#

TODO: look more into voronoi++
also figure out disjoint set union and vertex hash discretization

primal shadow
#

@wicked notch I kind of want to try LODs. What do I need to know?

buoyant summit
wicked notch
#

what an article to wake up to

frank sail
#

that article seems like exactly what you are working on lol

#

now you can copy his homework. how serendipitous

wicked notch
#

he's using meshoptimizer for clusterization, which is what I'm struggling with

#

but the "welding vertices together" is extremely valuable info

#

so I'll be copying that indeed KEKW

wispy spear
wicked notch
#

voronoi is definitely helping me

#

I really gotta thank all the magic math guys

#

it's basically the dual of a delaunay's triangulation

wicked notch
#

I'm also about to commit crimes against unreal engine

#

none of these planes are connected together, all vertices are unique

wispy spear
#

thats how Cities Skylines 2 renders forests, no? ๐Ÿ˜„

wicked notch
#

ok blender did not understand the assignment

#

I'll do it myself

wicked notch
#

epic

#

it crashes unreal

#

welp

#

it's 100% valid gltf

wicked notch
#

I managed to import it

#

and yep, the real deal fails hard here

wicked notch
wispy spear
#

hehe my engine does not support untringleized stuff

wicked notch
#

btw @frank sail

#

remember the eternally broken HZB?

#

for some reason it's not broken anymore

#

I have no idea why this even works

#

but if during the HzbCopy shader, instead of doing an initial reduction, you just take the sample and shove it into level 0 and then reduce from there

#

you get no flickering or artifacting of any kind

frank sail
#

the gnomes invaded your PC and fixed it

wicked notch
#

can you try it out in frogfood as well, just to check whether I'm going crazy or not

#

because it actually works

#

I don't know why it actually works

frank sail
#

after you wake up

wicked notch
frank sail
#

sounds like 1-2 lines

wicked notch
#

oh yeah

frank sail
#

originally we took the nearest 4 texels or so

#

but it sounds like you want to do something different

wicked notch
#

it's just this ```glsl
void main() {
const ivec2 sourceSize = textureSize(g_inputDepthImage, 0);
const ivec2 targetSize = imageSize(g_hzbMainImage);
if (any(greaterThanEqual(gl_GlobalInvocationID.xy, targetSize))) {
return;
}
const vec2 ratio = vec2(sourceSize) / vec2(targetSize);
const ivec2 sourcePosition = ivec2(vec2(gl_GlobalInvocationID.xy) * ratio + 0.5);
const ivec2 destPosition = ivec2(gl_GlobalInvocationID.xy);

const float depth = GetSample(g_inputDepthImage, sourcePosition);
imageStore(g_hzbMainImage, destPosition, vec4(depth));

}```

#

GetSample has an additional safeguard where I do min(position, textureSize(depth, 0) - 1)

frank sail
#

aight

#

how could that possibly be correct doe bleakekw

wicked notch
#

I have zero clue

#

this should not work

frank sail
#

wait what does GetSample do exactly

wicked notch
#

and yet it does

frank sail
#

is it just texelFetch

wicked notch
frank sail
#

ok yeah

#

how tf

wicked notch
#

I do not know

frank sail
#

my guess is that your code subtly fails still

wicked notch
#

yes

#

I did not observe this

frank sail
#

I can't test rn because I'm deep in this uncommitted code (and need to eep soon)

wicked notch
#

I tested on the classic intel sponza long tile and it works fine

#

which scene was used the last time we observed HZB failure?

frank sail
#

uh I think khronos sponza

#

that one had the worst artifacts for me

wicked notch
#

after observing extensively for 5 minutes, I cannot see artifacts

#

it slightly underculls though

frank sail
#

but it does cull something

wicked notch
#

it does cull a lot yes

frank sail
#

when you say it underculls, does that mean it's bugged

wicked notch
#

no it culls every smol meshlet inside

#

the camera is outside of sponza

frank sail
#

ah so this is just not culling the yuge meshlets

#

which is expected

wicked notch
#

works fine here too

#

this is at 1920x1080 with a 1024^2 HZB

#

it makes zero sense

wispy spear
#

driber update perhaps in between which fixed something?

wicked notch
#

I've had geforce experience that is nagging me to update my drivers for a while now KEKW

wispy spear
#

or jaker made someone add IrisVk.exe to "the list"

wicked notch
#

ShooterGame.exe

faint crane
#

We experienced some solar radiation storms up north which may have flipped the bit you needed to fix Hi-Z culling.

wicked notch
#

to avoid clogging the #engine-dev channel

#

we have unity

#

it is now time to reverse engineer the shit out of this engine

wispy spear
#

does it have hzb and vsm? ๐Ÿ˜„

wicked notch
#

maybe it does have hzb, but surely not VSM

#

that means it's 1-0 already for GP

wispy spear
#

hehe

#

i also saw this popping up in my recommendations https://www.youtube.com/watch?v=w99UcsgkUgE

This video is the first in a series of two lectures given by Keenan Crane at the Harvard FRG Workshop on Geometric Methods for Analyzing Discrete Shapes: https://cmsa.fas.harvard.edu/frg-2021/

Day II: https://www.youtube.com/watch?v=JQ2burHX710

Abstract: The intrinsic viewpoint was a hallmark of 19th century geometry, enabling one to reason ab...

โ–ถ Play video
wicked notch
#

oh nice

#

bookmarked

wispy spear
#

seems to be a multipart-er thing

wicked notch
#

oh wow, unity has shit default shadows

#

peter panning mmm yes

#

oh damn unity uses dx11 by default KEKW

wheat haven
#

it feels like dx11 is still the default for most things these days

wicked notch
#

uh

#

is unity shit by default or am I just too inexperienced with unity

wispy spear
#

both perchance

#

im sorry, i couldnt resist

wicked notch
wicked notch
#

besides the plane being 200 triangles of allah

#

they don't do any indirect shenanigans, meshletisms, HZBs

wheat haven
#

yeah I see z-prepass, CSM, rendering, postfx

#

and one other pass which might be bloom? I can't tell

wicked notch
#

it's SSS

wheat haven
#

ah

wicked notch
#

of all things KEKW

#

wait hold up

#

I need to use HDRP if I want the goodโ„ข๏ธ stuff