#Iris - A Journey through OpenGL and beyond to learn Graphics
1 messages · Page 6 of 1
oh this is in a shader 
Yes, I wanted to do this basically:
layout (buffer_reference) buffer BDA {};
void main() {
const BDA ptr = BDA(address);
}```
ye saw your #vulkan post 😄
I just assumed C++
I think what you can do is make another struct (or buffer declaration) where all the members are const, then cast the address to that
honestly quite incredible
can you put the readonly qualifier on the buffer declaration
Actually yeah
That might do the tricc
But there is another problem
Wait, there is no problem
I am shrimply dum
I forgor explicit_shader_arithmetic_types
[validation] Validation Error: [ UNASSIGNED-Device address out of bounds ] Object 0: handle = 0x26427633360, type = VK_OBJECT_TYPE_QUEUE;
| MessageID = 0x1a898625 | Device address 0x111d0030 access out of bounds. Command buffer (0x26432dba6a0). Draw Index 0. Pipeline (0x89e60f0000000042). Shader Module (0x5c528300000
0003e). Shader Instruction Index = 226. Stage = Vertex. Vertex Index = 2 Instance Index = 0. Shader validation error occurred in file ../shaders/0.1/main.vert at line 43.Unable to find suitable #line directive in SPIR-V OpSource.
Validation truly is, broken (as in too powerful)
How do you even detect this 
noice
I have no idea how I've lived without it for so long
VK_EXT_mesh_shader has been conquered
And it is faster than anything I have ever written in OpenGL somefuckinghow
HOW is this faster than my microoptimized, single drawcall, indexed meshlet emulator
This doesn't make any friggin sense 
This doesn't even have vertex quantization, or anything at all really
It's just bruteforce meshlets and it's somehow faster..
All the time I spent optimizing the emulation in GL 🥲
After a very slight optimization, still no quantization, this is what it looks like
500 microseconds to render bistro (no culling at all) 
for comparison against vertex shader I am getting 740 microseconds on a RX 5700 XT, also without any culling
Very nice
rtx 3070 is a more performant card so some difference definitely comes just from that
you render exterior only, yes?
No actually, interior as well
what the. The new amd drivers dont have GL_ARB_texture_compression_bptc anymore.
(Need to have some some compressed internal format, otherwise interior fills vram completely)
an other bug ticket it is
yes I consider that a bug
rip
Wait
My Vulkan thing has no textures
holy shit 
I completely forgot about implementing textures
Good thing bptc is core now
i am dumb the extension string I was checking for had a typo
forgot that
I am stupid, with traditional rendering is what gets written to the depth buffer just gl_Position.z after w division?
and after the viewport transform, I believe
but I've never used manual depth range so idk
``` This should work in Vulkan too right?
Yes
I have huge respect for Unreal Engine devs
I already had huge respect but now that I'm trying to do what they do
It's incredibly painful 
nsight does not support R64_UNORM textures
The dudes over at epic really just did Nanite without debuggers 
or has tools which can handle it
or early versions of nsight etc, im sure there is some cooperation happening
Likely yeah, they probably have their own debuggers and tools tbh
that should encourage you to make one yourself too : >
I am only one human being
but a smart one
It's a numbers problem 
doesnt make you less schmart 🙂
Ladies and gents
We got the thing
we have depth, meshlet ID and primitive ID inside a 64bit framebuffer
hybrid software/hardware rasterization coming soon™️
#version 460
#extension GL_ARB_separate_shader_objects : enable
#extension GL_EXT_shader_explicit_arithmetic_types : enable
#extension GL_EXT_shader_image_int64 : enable
#extension GL_EXT_mesh_shader : enable
layout (location = 0) in i_vertex_data_block {
flat uint i_meshlet_id;
};
layout (r64ui, set = 0, binding = 1) uniform u64image2D u_visbuffer;
void main() {
const uint64_t depth = uint64_t(floatBitsToUint(gl_FragCoord.z) & 0x3fffffffu);
const uint64_t payload =
(uint64_t(depth) << 34) |
((uint64_t(i_meshlet_id) << 7) & 0x07ffffff) |
((uint64_t(gl_PrimitiveID)) & 0x7f);
imageAtomicMax(u_visbuffer, ivec2(gl_FragCoord.xy), payload);
}
``` world's weirdest fragment shader 
Uhh
I think position is kinda broken 
vec3 unproject_depth(in float depth, in vec2 uv) {
const vec4 ndc = vec4(uv * 2.0 - 1.0, depth, 1.0);
const vec4 world = inverse(u_camera.data.pv) * ndc;
return world.xyz / world.w;
}
const vec3 position = unproject_depth(depth, gl_FragCoord.xy / vec2(resolution));
Am I stupid or is this fine?
anyways, looks okay
If you invert PV and transform NDC you get M
world_with_funny_w
looks cool though
I fixed
somehow gl_FragCoord.xy / resolution is different than getting uvs from vertex shader
barythingy vs perspective perhaps?
Probable
good progress either way 🙂
We have le normals
I can still do analytical partial derivatives let's goo
I don't understand the need to rescale the derivatives though 
perhaps doing the math by hand could be beneficial
Is EmitMeshTasksEXT just a fancy vkCmdDispatch except it's in a task shader? 
Yes
Hmm
Too late, I sleep now. Gn 
Gn sir
By the time you will awoke from your slumber, meshlet culling shall be fully functional
Hmm, I don't understand glsl subgroupBallotExclusiveBitCount
gl_SubgroupSize could be 64 in case of AMD, so how does this return uint? 
Does subgroupBallotExclusiveBitCount perhaps count bits up until gl_SubgroupInvocationID (excluded)?
wdym
subgroupBallotExclusiveBitCount returns uint, that's 32 bits, not enough to hold all subgroup ballots (for wave64's)
uint subgroupBallotExclusiveBitCount(uvec4 value) returns the exclusive scan of the number of bits set in value, only counting the bottom gl_SubgroupSize bits (we'll cover what an exclusive scan is later).
number of bits

Makes sense then
only counting the bottom gl_SubgroupSize bits is an extremely covoluted and cryptic way of saying "we only count bits up until the current subgroup's ballot excluded"
Unless that's not actually what it's saying
It would make sense though
Since you can do stuff[base + subgroupBallotExclusiveBitCount(vote)] = other_stuff
layout (local_size_x = WORKGROUP_SIZE, local_size_y = 1, local_size_z = 1) in;
layout (push_constant) uniform pc_data_block {
uint64_t meshlet_address;
uint64_t vertex_address;
uint64_t index_address;
uint64_t primitive_address;
uint64_t transforms_address;
uint meshlet_count;
};
taskPayloadSharedEXT task_payload_t payload;
void main() {
const uint meshlet_id = gl_GlobalInvocationID.x;
const bool is_visible = meshlet_id < meshlet_count /*&& frustum_cull(meshlet_id)*/;
const uvec4 vote = subgroupBallot(is_visible);
const uint surviving = subgroupBallotBitCount(vote);
const uint offset_index = subgroupBallotExclusiveBitCount(vote);
payload.base_meshlet_id = gl_WorkGroupID.x * WORKGROUP_SIZE;
payload.meshlet_offset[offset_index] = uint8_t(gl_LocalInvocationID.x);
if (gl_LocalInvocationID.x == 0) {
EmitMeshTasksEXT(surviving, 1, 1);
}
}
``` World's dumbest task shader
holy shit it works
Mfw it's easier to write task shaders than to do frustum culling with infinite/reverse Z projections
I am dumb and stupid
I always forget to do prim_count * 3 instead of just prim_count
ffs
Turns out the projection is fine
My plane extraction method also works fine (I think)
It took a long while
But we did it
world's most efficient rasterizer
But we do not stop here
We can be efficienter
is this the page where you'll be documenting nanite impl progress?
Yes
It's moslty memes + me ranting about stuff I don't like though 
Notice that I use the easy way out, i.e I use mesh shaders
Nanite emulates them, I do have an OpenGL prototype for mesh shader emulation, but it's a pain to work with
Awesome gonna follow this
Do they do that so they can support apis that don’t have mesh shaders?
Yeah
You could (in theory) run nanite on 10 year old GPUs if you really wanted to 
The minimum requirement is just 64 bit buffer atomics
Alright next on the list are
- HiZ occlusion culling
- Primitive culling
- Cluster screenspace area classification
And hopefully I can begin with compute rasterization soon™️
Hmm, gltfpack doesn't seem to be able to generate instances on its own
I mean it makes sense that one model = one instance, but eh
cgltf chokes on EXT_mesh_gpu_instancing 
I have been pondering
struct meshlet_glsl_t {
uint32 vertex_offset = 0;
uint32 index_offset = 0;
uint32 primitive_offset = 0;
uint32 index_count = 0;
uint32 primitive_count = 0;
uint32 group_id = 0;
alignas(alignof(float32)) aabb_t aabb = {};
};```
Given that my meshlet struct is about 48 bytes
Soon to become just 40, replacing the AABB with the sphere
Say a mesh subdivides in N meshlets and this mesh has M instances
Does it make sense to have N * M meshlets?
There will be a ton of redundancy...
do you predict that this change will have a big positive impact on $PERFORMANCE?
I don't care about that yet
I am just brainstorming how to do instanced rendering with meshlets
But it will have a huge impact on memory
Right now I'm uploading M times the same vertices to the GPU, over and over again 
hmm
#graphics-techniques message
pinned it too, jaker talks so much it would just go under again
Thanks m8
Instances!
Hmm my instancing is borking meshoptimizer somehow
It can't generate a proper vertex remap for bistro only
Perhaps I am breaking some assumption meshopt makes
Just now I realize how small a number 134217728 is 
With big scenes the number of meshlets is insane
Powerplant alone is 12 million (instanced) meshlets
oof
I have now reduced the number of meshlets considerably
at the cost of more memory
for fucks sake
Why can't I have 1 terabyte of VRAM
Everything would be so much easier
I was debugging why displaying normals would cause a device lost
Even though displaying meshlet ID would work just fine
Turns out I was passing 0 as the meshlet instance buffer address
How the hell was it working before
???
Alright now powerplant is 200628 meshlets 
Does performance improve to a point with more meshlets or degrade? Is it a balancing act to keep the number within a good range?
Of course performance scales pretty much linearly with the number of meshlets, but occupancy stays the same
Whether you send 100 meshlets or 100000 meshlets the GPU will happily process them at full speed
Perhaps it would be a cool experiment to merge ALL meshes into a single huge mesh and derive meshlets from that
powerplant's primitives are also quite awkward
all the pipes are one, iirc for example
Hmm
I currently do depth testing with a shrimple imageAtomicMax
Actually, nevermind whatever I was thinking 
I really don't like HiZ
It's overly conservative at times, you have to handle disocclusion events, frame 0 is a special case
you could play Rust or PubG instead

didnt you ponder around the other day, that hiz doesnt do anything for your already good meshlet renderisms?
or is that a different thing
Perhaps I misspoke, occlusion culling is definitely useful
I just have a personal feud with HiZ 
The impl is also straight from Niagara so..
Anyways, I don't think it's practical to do ROC with meshlets, far too many AABBs lol
The culling would not be worth it I think, perhaps I could experiment
It wouldn't integrate very well with the TASK/MESH pipeline but eh
It's just one more buffer, what's wrong with that
What do you guys predict, will ROC be worth it?
Bets are open
republic of china
yes
ah
Tbh I kinda want to just leave HiZ as is and come back to it later
I would like to move onto the next step, which is cluster area/error estimation
isnt roc "just"(tm)
's new cuda?
And software rasterization for the big gains
That's ROCm 
Yes, ROC is raster occlusion culling
oops : > you got me
there are exercises for that
Which means I am on my own 
perhaps pester him to figure something out with you
you are the only one actually trying to achieve something here
@glass sphinx lustri is complaining that you did not do cluster area estiminiation yet
I am not
i only touch opengl with a stick
we can plan it together
Worry not, I am currently using our lord and savior 
goooood
so
did you ever want to join a cult before?
i have good news
we are recruiting

first you need to sign a clause that makes you my property forever
i am gathering the convincement crew
It's a good and friendly cult
you only need to recruit 5 more ppl as a payoff
for getting introduced
it's worth it tho
we can make it 3 if you are cool
which you seem you are
are you a c++ person @wicked notch ?
Did somebody say Daxa?
They're raiding me 
deccer look what you did
we come and conquer
Anyways, I'm honored about this invitation, but I'm afraid I can't join your cult 😦
👹

anyways we need more daxa people to compensate my lazyness get even more features in
if you ever feel the need to completely rewrite everything for no reason in daxa we are eating you alive always here
Sure, I will eventually get tired of writing stuff on my own 
merge it into daxa
Suffering is meant to be shared
especially when it's caused by GP
Truer words have never been spoketh
btw is there a way to see the post quickly?
it seems like i must scroll up throu 100 quadrillion messages
Ah yeah, unfortunately I did not have the foresight to pin the initial thing 
But worry not, I just recently started with Vulkan
I went straight for mesh shaders (the EXT one)
template addict very good
Holy shit Patrick relax on the cult behavior
nice
discord is so shit
how do they miss obvious features like that i cant see the original post

before you fine daxa people try to sell more carpet to my italic friend here, help him figure out cluster-mesh-thingy
then he's all yours 😄
Look at him, abandoning me like this
i get a commission, its worf it
Lmao
This smells of being usefull they don't teach that at Daxa academy 😦
crimes
@wicked notch can you link and pin the github?
search for a common keyword and sort by old
discord is very cool
OpenGL garbage: https://github.com/LVSTRI/Iris
Vulkan garbage: https://github.com/LVSTRI/IrisVk
I suppose you won't be much interested in the GL one
soon daxa garbage pure gold
its not garbage, its brainworm material tbf
Daxa is not garbage
Daxa is literally the best vulkan abstraction possible
Except it's missing a couple things 💀
no sales pitches we need solutions
that is where lvstri comes in
i like your code
Thanks, it's missing a lot of niceties though
The code looks awesome
A proper render graph for example 
damn this code is super clean
I mean, thats a pretty tall task for a nicety 😄
task list burned oput so much of by brain
since task list i cant type properly anymore
same except for me it was after filling in sTypes all day
we had pain bugs caused by ommiting them by accident in some places
😄
I don't understand how Darianopolis managed to export FBX in unreal tbh
I'm trying it with Moana assets and it exports a botched version, the lowest LOD possible 
THERE ARE FOUUUUURR LIGHTS
at last
tng ist wirklich gut
@wispy spear pro tip: von den neuen star treks (die all poopy sind) ist die neuste (strange new worlds) wirklich gut
grosse empfehlung
pike ist wundervoll
@glass sphinx you're late
hab alles schon gesehen
und ja du hast recht
was is ziemlich kacke finde is spock und die schwester, fand seine frau viel cooler
und die nunien tzung tante nervt auch jedesma mit ihrem gorn shit
schade das der andorianer maschinenraumfutzi wech is, der war der beschde
das a jacky chan movie
we're just lamenting about the latest star trek isms, where strange new worlds is the better out of the 3 new shows
Star trek things aside
I think I have the best possible implementation of HiZ my brain can manage
And it is reasonably conservative
A little demo
FSR2 will come soon™️
Also is it just me or are these normals fucked
This is the first I've seen this 
hm
der war echt der beste
super sad
aber hatte sehr gute folgen
if you don't remap your normals, half of them will be black
Yeah, I meant the green being -X
o
On the little thingy in the middle
n * 0.5 + 0.5
ye this do not be good
huuh
bh/2 * numberOfTris
Course it is
The formula to calculate the area of a regular polygon is, Area = (number of sides × length of one side × apothem)/2, where the value of apothem can be calculated using the formula, Apothem = [(length of one side)/{2 ×(tan(180/number of sides))}].
Das a lot of data
Well not a lot
But hm
I wonder how Unreal does it, I was thinking of computing the area of a clip space projected AABB around the cluster
So like, (box.z - box.x) * (box.w - box.y)
But this is extremely skewed towards hardware rendering
And hardware rendering is cringe
I should also determine clusters whose triangles are going to be clipped
Given that we do things per vertex, I could check if vertex.xy > 1.0 || vertex.xy < -1.0
And mark the cluster as hardware rasterizeable if so
Cutting the area of the projected AABB in half could be good
Depending on uh, literally everything
Jaker, could you lend me your braincell
I can lend a froge a braincell
ok so you're trying to see how big a cluster is in screen space as a heuristic to determine which rasterizer to use?
yes
did you already give up on getting the bounding box
Later on to determine which lod to use, but that's a story for future me
I have not
It's the only way I could think of 
Of course I will try it, I would like to hear other smorter/dumbere ways
das ist cool btw, just saw the motion picture in all normal glory
I still don't have textures btw 
soon(tm)
I won't introduce a memory bottleneck immediately so that I can still see gains from my ridiculous quest towards Nanite
Or well, a scuffed, dumber and worse version of Nanite 
Inigo Quilez to the rescue pog
I think it might be more ideal than the screen space AABB in some instances
idk if it's better in general
i rescently got 25tb/s vramthroughput
profilers do be sniffing glue sometimes
ultra wrong cluster area detection 
At least clip detection is working (no white pixels at the edges)
Alright we did it boys
Blue = Small Area = Software Raster
Red = Big Area = Hardware Raster
Black = Clipped = Hardware Raster
Everything converges to blue as distance grows as expected
Now I gotta do soft rast 
damn this thing is getting better and better ❤️
Is this still opengl ?
Or have you wandered off to vulkan ?
This stuff looks soo cool
Fair enough
I also recently started learning vulkan
And the validation layers are so much better than the debug call backs
Well uh
I have IrisGL that works pretty well, it has shadows and all, so I can use that thing as a reference sometimes
Don't slander my boy GL 😭
Yes, the last updates have been in Vulkan
LVSTRI one of the people i would hire if i could
workoholics that write pretty code are very good for business
Switched everything to Vulkan or doing hybrid with gl?
wouldnt surprise me if lustri interops some dx12 into the mix for some obscure reason heh
I think it's finally time to add textures
Man, 128GB of RAM feels truly liberating
Unreal uses up to 64 and I still have 64 left 
with a super long delay, textures (meshlet flavor)
Hey nice!
What causes it to need so much system ram? Are you having to stream from system to vram a lot or does it mostly stay in system?
Not yet, I got so much RAM because of Unreal Engine (which I use as editor) and blender 
They were taking up so much RAM I couldn't bear it anymore
Are you running both in parallel
three extra chrome tabs is really nice isn't it?
No but blender sucks up a lot of RAM 
converting Bistro from FBX to GLTF was my breaking point to go from 16GB to 32GB 
I went from 32 to 64 because I was often swapping because of Painter...
(We like RAM and VRAM a lot)
Substance Painter?
3D software just eats up all ram and vram lol
well I'll see if my GPU has enough for the stuff I want to do 
Hmm some clusters have triangles scattered about all over the place
The hell is meshoptimizer doing
It is now time
compute rasterizer
except I am sleep
So that is deferred to tomorrow 

Hmm
How should I schedule work for my compute rasterizer
Perhaps one workgroup per meshlet and one thread per primitive?
It's gonna have a lot of dead invocations...
primitive_count is never going to be MAX_PRIMITIVES
More indirection could solve this though 
World's most absolutely unhinged and stupid software rasterizer 
polygon fill: todo
Hmm this isn't very promising
Granted this is probably the most inefficient way possible of doing a compute rasterizer
But right now it's also not doing much...
Perhaps I am spending a lot of time idle
Uhhhh
How the hell do I read this?
Oh well, I can see that the only two if statements are taking up a combined 120% of the time spent in the shader 
I think the added indirection is necessary
I am spending over 300 microseconds idle
is ther a way to get rid of that if by somehow sorting the data before hand?
Yeah
I need to do an extra processing step
I should make a buffer with all software rasterized meshlets and another with an index to all software rasterized primitives
Actually no
it's probably better if I make a single buffer with primitiveID | meshletID << 7
WG size will be 256
And we round down as usual
looping over all primitives vs looping over all meshlets hmmmmmm
deep ponderation
Looks like nanite does the former
And it is somehow faster
Even deeper pondering
I am dispatching 476394 workgroups, each workgroup has a local size of 128
The average number of primitives is about 64
Which means that of 60978432 threads, 30489216 are doing nothing 
fouf, sounds like a lot wasted potential
with meshlets you generally want to try to merge as much as possible
otherwise you get low occupancy
Yeah but I just realized with this solution I am shrimply moving the problem elsewhere 
Perhaps making smaller meshlets is better
As a quick sanity check I tried making the meshlets smaller, but the time took by rasterizer is the same....

Any change I make impacts minimally, looks like idle threads are not the bottleneck?
I strongly believe I am doing something wrong, I will inspect more closely
The barriers and loads are taking up most of the time, hmm
@frank sail could you help a smol-brained frog in need?
Can you extract any useful informations from this
This is with hardware raster (consider only meshlet_cull_and_draw)
what part of the frame am I looking at
meshlet_cull_and_draw (I have disabled culling btw, for testing hw/sw)
This is software
Occupancy is good both on HW raster and SW raster
Thread coherency is also 99%
The software rasterizer also doesn't do primitive filling, it just renders points
I could show you the code if you are interested
But it's really basic
shared meshlet_data_t meshlet;
shared vec3[64] vertices;
void main() {
const uint meshlet_instance_id = gl_WorkGroupID.x;
const uint meshlet_id = meshlet_instances[meshlet_instance_id].meshlet_id;
const uint instance_id = meshlet_instances[meshlet_instance_id].instance_id;
if (is_candidate_sw_raster(meshlet_id) && gl_LocalInvocationID.x == 0) {
meshlet = meshlet_data[meshlet_id];
}
barrier();
if (!is_candidate_sw_raster(meshlet_id)) {
return;
}
const mat4 transform = instances[instance_id];
if (gl_LocalInvocationID.x < meshlet.index_count) {
vertices[gl_LocalInvocationID.x] = fetch_vertex_and_project_to_ndc(gl_LocalInvocationID.x);
}
barrier();
const uint primitive_id = gl_LocalInvocationID.x;
if (primitive_id < meshlet.primitive_count) {
const vec3[] triangles = rasterize_triangles(primitive_id);
imageAtomicMax(visbuffer, ivec2(triangles[0].xy), triangles[0].z);
imageAtomicMax(visbuffer, ivec2(triangles[1].xy), triangles[1].z);
imageAtomicMax(visbuffer, ivec2(triangles[2].xy), triangles[2].z);
}
}```
This is the gist
The things that take the most are the two barriers (and I have no idea why)
remove the barriers 
wg size is (128, 1, 1) and invocation size is (meshlet_count, 1, 1) btw
Or one workgroup per meshlet, one thread per primitive
It is almost a 1 to 1 copy of what unreal does 
hmm I guess you can't really change the wg size
but yeah barrier with a big wg will be slow
ouphe
make wg bigger then 
2x worse perf
make it samer
Is this barrier really that destructive?
if (gl_LocalInvocationID.x == 0 && is_sw_rast) {
meshlet = meshlet_ptr.data[meshlet_id];
}
barrier();```
That'd add a lot of VRAM traffic
ah
It it copying something like, 64 bytes of data in shared memory
How is this barrier taking half the time spent in the shader
well it's possible it's just an artifact of how the profiler reports things
in RGP, actual load instructions will appear very cheap, but then their cost will show up at s_waitcnt instructions
just one teensy weensy little load
Like legit it's just this
void main() {
const uint meshlet_instance_id = gl_WorkGroupID.x;
const uint meshlet_id = instance_ptr.data[meshlet_instance_id].meshlet_id;
const uint instance_id = instance_ptr.data[meshlet_instance_id].instance_id;
const uint primitive_id = gl_LocalInvocationID.x;
const bool is_sw_rast = cluster_class_ptr.data[meshlet_instance_id] == CLUSTER_CLASS_SW_RASTER;
if (gl_LocalInvocationID.x == 0 && is_sw_rast) {
meshlet = meshlet_ptr.data[meshlet_id];
}
barrier();
}
Oh wait there is another load
the classification
Let me try forcing SW
It's the same
I am dying
Alright
I will rasterize NOTHING
?????????????????????
16 milliseconds for doing nothing

What the hell is happening
are you dispatching a lot of wgs or something
uh
maybe
is 500'000 considered a lot
btw I cracked it
it was the imageAtomicMax
I have to thank the profiler for misleading me and doing absolutely nothing to help me figure it out 
Now it's taking the same as the hardware rasterizer
Which is still terrible
it should be taking 3x less time than HW raster in this particular case (according to unreal)
well
Was I wrong to expect a crazy speed up maybe?
Ah beautiful
On todays episode of: things that make no sense
Thai scene: 30 million primitives so many small triangles, compute matches raster (compute should be faster)
Bistro scene: 5 million primitives and many big triangles, compute is faster than raster
heh
maybe your heuristic to check for median tri size is off
I am doing full software vs full raster right now
ah ok
I'm surprised that it's possible to match/beat raster hw at all
its perf breaks down when you have really bad quad occupancy
man I should've started following earlier
with thin or tiny tris
is one of the drawbacks that you need to store a buffer of all triangles ever instanced in your scene? I can't imagine you can beat the hardware vertex cache
idk how much of a perf uplift hw vertex reuse is, but clearly it's not unbeatable
you also lose a lot of potential vertex reuse when you render unconnected meshlets
true
I wonder what the total memory requirement is to render bistro is (minus images)
btw, I wonder how much vertex reuse you can get with shared memory
oh wait
I think lvstri is already loading verts to shared mem
oh I think I get it, because you're working with meshlets you have a known bound on the triangles you're working on
so your shared mem gets loaded with your meshlet's vertices and you go from there
I guess you could do something like this
layout(group_x = 128) in; // max number of verts in a meshlet
fetch and shade vertex[localInvocationID]
store transformed vertex in shared memory
barrer()
if (localInvocationID < numPrimsInThisMeshlet)
assemble primitive[localInvocationID]
rasterize primitive
yeah, rasterizing 1 primitive per thread sounds kinda funky though
is that what the hardware does
well, there is dedicated hw for rasterizing prims, so it's super fast
but in this case, your prims are only a few pixels large at most
though I guess since we're talking small triangles specifically that this is used for
yeah
software rasterization in compute always seemed pretty cumbersome to me because both the vertex and fragment operations essentially decompress into a ton more data to process in the next stage
but it makes sense how this technique deals with it
it works well in very constrained situations
in nanite, they're working with tiny triangles in a visbuffer-like renderer (so the fragment shader is literally just writing depth and a triangle/instance ID)
and best of all, infinitely bikesheddable
no
you can do the same tricks as meshshaders do
you shade all verts in a meshlet within a workgroup
then share the results and create triangles
then rasterize
my frogge
that can and will beat the hw vertex cache hard
you should probably never attempt to even do any frag shading
the only strong usecase with software raster is to write a visbuffer with depth
frog shading
I think I cracked the code
Perhaps the reason my compute raster was so garbage, was due to unconditionally imageAtomicMax'ing
I should've just done the classic
for (uint x = min.x; x < max.x; ++x) {
if (is_inside_triangle(x, y, ...)) {
imageAtomicMax(visbuffer, ...);
}
}
}```
I’m guessing when you switched to conditional atomic the level of contention went way down?
I'm still testing right now, results will be in soon™️
Alright results are in
And what sad results these are, occupancy remained the same, after all, rasterizing pixel sized triangles is quite easy
As did the time took for Thai (2.83)
I am going to assume something is fundamentally wrong with the way I build this software rasterizer
Until I find someone to pester about this, software raster is on hold 😦
rip, what algorithm do you use to actually rasterize btw
which algorithm? scanline? checking if a pixel is in the triangle in an AABB?
For each pixel in the triangle bounds yes
Yeah it is quite bad, but somehow still manages good occupancy
Yes one prim per thread
I'll try unreal's approach
what's unreal's approach?
They have a hybrid scanline/pixel in AABB method
They choose one based on triangle screen footprint
oh gross I need a UE account and need to connect it
Yes, very sad
I was thinking about the way I classify meshlet area
Could I compute the perfect area of a cluster in local space at load time and then scale that based on view distance and the transform's scale? 
Well it doesn't really matter right now, gotta solve sw raster first, I'll give it a few tries more and then move on
does that not depend on how (as in where) you look at the mesh, which you cant possibly know at load time?
Yes, at load time you compite a "baseline", the true area of the cluster in local space
Then, the idea is to scale that area based on view distance and transform's scale
Hmm perhaps this does not work with disconnected clusters (i.e. triangles that are not connected but share the same cluster ID)
aren't you generally trying to avoid having those though
Yes but meshoptimizer can't help but make disconnected cluster sometimes
I might make my own meshletizer
Or hack into meshoptimizer and fix that "bug"
sounds like some uv unwrap, which also contains depth information : > (but ignore me, im just blabbering)
I have reached a conclusion
Actually two conclusions
Conclusion #1: I was indeed doing fundamentally flawed calculations
Conclusion #2:
void rasterize(in vec3[3] triangle, in uint64_t payload) {
const vec4 bounds = make_bounding_box(triangle[0], triangle[1], triangle[2]);
const uint start_x = uint(bounds.x);
const uint start_y = uint(bounds.y);
const uint end_x = uint(bounds.z);
const uint end_y = uint(bounds.w);
for (uint x = start_x; x < end_x; ++x) {
for (uint y = start_y; y < end_y; ++y) {
const vec3 barycentric = make_barycentric(triangle[0], triangle[1], triangle[2], vec2(x, y));
if (barycentric.x < 0.0 || barycentric.y < 0.0 || barycentric.z < 0.0) {
continue;
}
const float z = dot(barycentric, vec3(triangle[0].z, triangle[1].z, triangle[2].z));
imageAtomicMax(u_visbuffer, ivec2(x, y), (uint64_t(floatBitsToUint(z)) << 34) | payload);
}
}
}
``` This is pure and utter garbage 
it doesn't respect any rasterization spec ever created
Am I overthinking this? What the hell is NaniteViewAndInvViewSize and NaniteViewRect
what gpu?
RTX 3070
strange
Perhaps their clusterizer is that much better than Meshoptimizer?
hmm never seen select in hlsl before, only know it from socket nonsense
It's old behaviour for ?: apparently
Thing is, what ?: does isn't documented either for vector types 
Guessing purely from naming select is any(…)
Then if it’s different from :? then that should be all
select sounds more like glsl's step
Regardless, something is, once again, fundamentally wrong
Mesh shaders are nice but they lack flexibility in choosing a meshlet's size
On NV it's either 64/126 or death
write about it, perhaps it tickles some $GPUVENDOR engineer's interest
I don't think they will change their schtuff because I can't make a compute rasterizer efficient 
heh
There's nothing wrong with 64/126 per se, it's the workgroup size mismatch that kills me
And NV likes 128 a lot more (for compute)
Task/Mesh is 32 only
But regardless, occupancy is fine, I'm always and forever limited by VRAM
too bad you're in uncharted territory with this stuff
you could ask peeps to run your stuff on different hardware, if that helps
to collect some data
AMD hardware likes completely different things from NVIDIA's 
package telemetry with your app to collect extra data
NVIDIA likes a WG of 32 for task/mesh, AMD likes one vertex/primitive per invocation
give it a little ifdef, as a treat
or compile 2 binaries
Could you send a little treat

I send you my regards
A 7900xtx will suffice
Revolution!
peope who wrote those shaders for unreal are probably in the copyright notice or commit log
mayhaps reach out
I could
worst they could do is send Hitmen to terminate me due to copyright violation
or have you hired
Jaker
you are at AMD
explain what primitive shaders are and how they are different from mesh shaders
it's your patent after all
what have you googled
"The vast majority of triangles are software rasterised using hyper-optimised compute shaders specifically designed for the advantages we can exploit," explains Brian Karis. "As a result, we've been able to leave hardware rasterisers in the dust at this specific task. Software rasterisation is a core component of Nanite that allows it to achieve what it does. We can't beat hardware rasterisers in all cases though so we'll use hardware when we've determined it's the faster path. On PlayStation 5 we use primitive shaders for that path which is considerably faster than using the old pipeline we had before with vertex shaders."```
from: <https://www.eurogamer.net/digitalfoundry-2020-unreal-engine-5-playstation-5-tech-demo-analysis>
here is moar info
https://timur.hu/blog/2022/what-is-ngg
NGG (Next Generation Geometry) is the technology that is responsible for any vertex and geometry processing in AMD RDNA GPUs. I decided to do a write-up about my experience implementing it in RADV, which is the Vulkan driver used by many Linux systems, including the Steam Deck. I will also talk about shader culling on RDNA GPUs.
epic
not sure how close you can get to handwritten primitive shaders, but I guess using mesh shaders and following best practices will probably get you there
I wanted to cope with: "Maybe my mesh shader is so good it doesn't need a software rasterizer"
Or something cringe like that 
But it turns out the PS5 does something different
Damn you AMD
Yes but they use vertex shaders
That means their soft rast path is faster*
*in certain cases
de2our
no way
https://youtu.be/vYqlbzrtI9Y
Multum In Parvo: Level of Detail and Approximation Models at the Graphics Nexus
Tamy Boubekeur, Adobe Research
Keynote - HPG 2023 - Day 3
idk, I don't do it
who does it
@ pixelduck @ nanokatze @ mohamexiety @ martty @ pac85
Is driver dev on NV impossible?
noyes
They don't share anything and the only open source driver sucks (I heard at least)
Hmm a 6950xt costs 600 robux
best I can do is refer you to your nearest nvidia representative 
nearest? you mean in my walls
we're roommates (in your walls)
toomuchvoltage is at nvidia iirc 😉
working on drivers though?
not sure
or devtech stuff mayhap
Btw
I went back to our old friend GL
And the vertex path here matches compute raster and mesh shaders on Vulkan as well
I have never seen 3 ridiculously different techniques agree on performance so much
What the hell
I want to profile unreal
unreal is instrumented
It is time to compile unreal from source, wish me luck
with gpu frame marquers
Yes actually
I'm also stating that as a fact
I should first see if they list sw/hw timings
because I have to profile unreal every day at work 
epic
do you know how to get nanite sw/hw timings
Sparing me from the tedious documentation crunching 
uh you put D3D12.EmitRgpFrameMarkers=1 in DefaultEngine.ini and then profile it with RGP :^)
"go buy AMD hardware scrub"
How do you profile then
wdym
Connect from RGP to Unreal somehow?
you just hook up your favorite gpu profiler to the game you're profiling
Ah, do you have to export the game
for rgp, you just need to have rdp (the program that hooks into vulkan, dx12, and opencl apps) running before you launch the app
you can connect to the engine or a standalone build of le app in question
Oh that's great then
I don't want to wait 4 months for Unreal to package my stupid app with one mesh in it 
Last time I tried packaging my test thingy it took 1 hour
It had literally no actors beside a static mesh
Perhaps I did sumthing wrong
probably
I have done a lot of investigations
Rendering all kinds of scenes, bistro subdivided into oblivion as well 
I subdivided everything into at least 100 million triangles
And I am starting to see some gains, it appears that THE WHOLE viewport, has to be covered in pixel sized triangles for the HW raster to be much slower than the SW raster
Perhaps Nanite really is only needed for stupid amounts of triangles
i.e: 1 billion+
nanite only renders aeound 100 mil at 4k i believe
sw raster can also be great to allow for larger view distances
or later lod switching
Yeah, also overdraw I guess is much better with SW for larger draw distances
sadness
Why does Windows freak out when I go overboard the physical memory limit, it's using far more than 48GB 
blender is such a slow boy sometimes
I just re-read unreal's slides for the, uh
7th time 
And they say "we software rasterize all clusters where at least one triangle is more than 32 pixels wide"
huh
At least?
More?
I can't fucking write 
No more?
If all triangles are less than 32 pixels wide they software rasterize
That makes sense
Now, does it make sense to have a compute shader with invocation size meshlet_count and workgroup size (MAX_PRIMITIVES, 1, 1) that can check for that
It should also cull now that I think about it
HW raster gets the clusters that won't be SW rasterized yes
Workgroup sizing has always been sort of a mystery to me
Only good way I know is to use microbenchmark
I've googled without results, where do I find that tool
I mean you write your own microbenchmarks that tell performance of different group sizes.
Brute force search essentially
Results can be and often are gpu specific, so the benchmarks need to be run on each installation.
shared meshlet_glsl_t s_meshlets[MESHLETS_PER_WORKGROUP];
shared mat4 s_pvm[MESHLETS_PER_WORKGROUP];
shared vec3 s_vertices[MAX_VERTICES * MESHLETS_PER_WORKGROUP];
shared uint s_primitive_size[MESHLETS_PER_WORKGROUP];``` Do you guys think this is too much shared data? 
How big are the constants
MESHLETS_PER_WORKGROUP is 4, MAX_VERTICES = MAX_PRIMITIVES = 64
Eh I expected nothing and of course this classificator is bad 
With this new classificator I have perfect accuracy
Except it takes about the same time it takes for me to rasterize the entire model 
How the hell does Unreal do this
damnit
this is probably an area that needs a ton of testing on bug maps used in production
occupancy is great though 
note: the "cull" part is a LIE, there is no culling right now
MASSIVE
so what is the actual problem right now?
you can only rnder 10mio tris compared to UE's 100mio?
compared to what
To right now
but in what does it manifest itself
1.5ms to classify clusters is terrible
I have no idea, but considering the whole nanite pass takes less than 2 milliseconds..
its broken down into clusters already neh?
Yes, the classify shader is as efficient as I could make it
perhaps driver/hardware combos fuck with the results
This is the shader if anyone wants to take a look
I will test on other hardware just to see yes
have you profiled UE with nsight yet
or power states
I'll do it right now
Opening UE5 at the speed of light
This shader is the one taking 1.5?
yes
I wonder if the atomic ops are eating most of its time
It's VRAM bound to hell and back
I'd say vertex fetch and transform are eating up my precious milliseconds
Unfortunately I have no idea what LGSB is 
is there something you can do to simulate fetching fewer vertices/less vertex data to see if that helps perf
Yes, fewer verts does help
what about atomics
64 verts / 64 prims meshlets are the best
64 / 126 recommended by nvidia is kinda trash
try replacing atomics with regular ops and see if perf changes (ignoring the brokenness)
LGSB = LarGe String Buffer
removing the atomics quite literally changed nothing 
Damn alright
I guess AMD needs to invent infinity cache except for memory bandwidth
Infinite Memory Bandwidth (AMD patent pending)
I can
I have purposely left quantization for later™️
But I guess it's my only shot at better performance
out of curiosity i tried to open various nsight docs and tried putting LGSB into their searchboxes, 0 hits
which is quite weird
can you press F1 in that LGSB column/row within nsight
or is there a ? button in the system menu like good old win9x windows had
I hovered over LGSB and it said "Long Scoreboard" yes
when the classification stage takes more time than the rasterizer
I love absolute non sensical results
did you talk to devsh yet?
He's not active recently so I couldn't catch him
dont be afraid to boop him, im sure he has a few bits and bops to say about this
By the way something is 100% amiss with meshlets
There is no way in hell these 3 meshlets (remember, they have 64 triangles inside them) have a triangle MORE than 32 pixels wide
Red is hardware, Blue is software btw
Unless they are part of the same meshlet, which again makes no sense because meshlets should be continuous
inb4 bad perf is due to a bug causing you to render 64x more stuff
imma open an issue with meshoptimizer
Nevermind
zeux doesn't agree with me
mfw
I’m not sure I agree with this being a mistake. For mesh shading implementation on desktop my assessment has been that under filling meshlets results in more efficiency loss than the extra culling is worth.
At least I have peace of mind anything I've done so far wasn't wrong
Aight I guess I'll make my own custom meshopt based on the actual meshopt
It's funny, anything made by zeux I end up making my own fork
gltfpack => I have my own
meshopt => soon™️
I guess our opinions are very different 
lvstri trying to render 1 (one) mesh
Honestly lol
Anything I do does not conform to any generally agreed upon standard I have different requirements for everything 
Feels like I'm reinventing the universe
I'll follow deccus suggestion and take a break
Jaker you now have an employee
I expect paychecks
you could also implement a different part of the renderer if you don't want to break too hard
like shading
but yeah don't burn yourself doing too much schtuff
perhaps I'll port what I had back in opengl yes
Shadow maps took 4ms to render with culling back then
I wonder how much that improves with this ridiculously optimized pipeline
and speak to devsh
since wpotrick is rather useless 😄
and or write about your findings and problems
publish it somehow
potrick has helped me a lot with instancing, he's good
(i know, i was just kidding)
Also I think he just wants me to join his cult 
devsh has done full on meshlet stuff? I know he mentioned doing a visbuffer
yeah the so called brainworm
epic, I'll pong him tomorrow I guess
doesnt hurt

