#Iris - A Journey through OpenGL and beyond to learn Graphics

1 messages · Page 6 of 1

wicked notch
#

wtf

frank sail
#

it means you can't write Foo& foo const

#

because references are immutable

wicked notch
#

Hmm

#

That was not my intention, I wanted to do the classic™️ const Foo& foo

frank sail
#

oh this is in a shader frogstare

wicked notch
#

Yes, I wanted to do this basically:

layout (buffer_reference) buffer BDA {};
void main() {
    const BDA ptr = BDA(address);
}```
frank sail
#

ye saw your #vulkan post 😄

#

I just assumed C++

#

I think what you can do is make another struct (or buffer declaration) where all the members are const, then cast the address to that

#

honestly quite incredible

#

can you put the readonly qualifier on the buffer declaration

wicked notch
#

Actually yeah

#

That might do the tricc

#

But there is another problem

#

Wait, there is no problem

#

I am shrimply dum

#

I forgor explicit_shader_arithmetic_types

#
[validation] Validation Error: [ UNASSIGNED-Device address out of bounds ] Object 0: handle = 0x26427633360, type = VK_OBJECT_TYPE_QUEUE;
 | MessageID = 0x1a898625 | Device address 0x111d0030 access out of bounds.  Command buffer (0x26432dba6a0). Draw Index 0. Pipeline (0x89e60f0000000042). Shader Module (0x5c528300000
0003e). Shader Instruction Index = 226.  Stage = Vertex. Vertex Index = 2 Instance Index = 0.  Shader validation error occurred in file ../shaders/0.1/main.vert at line 43.Unable to find suitable #line directive in SPIR-V OpSource.
#

Validation truly is, broken (as in too powerful)

#

How do you even detect this KEKW

wispy spear
#

by running it

#

and catching this error

#

wrapping it in a function you can query

wicked notch
#

my first BDA triangle

#

BDA is so good

wispy spear
#

noice

wicked notch
#

I have no idea how I've lived without it for so long

wicked notch
#

VK_EXT_mesh_shader has been conquered

#

And it is faster than anything I have ever written in OpenGL somefuckinghow

#

HOW is this faster than my microoptimized, single drawcall, indexed meshlet emulator

#

This doesn't make any friggin sense KEKW

#

This doesn't even have vertex quantization, or anything at all really

#

It's just bruteforce meshlets and it's somehow faster..

#

All the time I spent optimizing the emulation in GL 🥲

wicked notch
#

After a very slight optimization, still no quantization, this is what it looks like

#

500 microseconds to render bistro (no culling at all) KEKW

finite yacht
#

for comparison against vertex shader I am getting 740 microseconds on a RX 5700 XT, also without any culling

wicked notch
#

Very nice

finite yacht
#

rtx 3070 is a more performant card so some difference definitely comes just from that

#

you render exterior only, yes?

wicked notch
#

No actually, interior as well

finite yacht
#

what the. The new amd drivers dont have GL_ARB_texture_compression_bptc anymore.
(Need to have some some compressed internal format, otherwise interior fills vram completely)

#

an other bug ticket it is

#

yes I consider that a bug

wicked notch
#

rip

#

Wait

#

My Vulkan thing has no textures

#

holy shit KEKW

#

I completely forgot about implementing textures

wicked notch
#

430 usecs for exterior only

#

Not much difference

finite yacht
#

i am dumb the extension string I was checking for had a typo

finite yacht
wicked notch
#

I am stupid, with traditional rendering is what gets written to the depth buffer just gl_Position.z after w division?

frank sail
#

and after the viewport transform, I believe

#

but I've never used manual depth range so idk

wicked notch
#
``` This should work in Vulkan too right?
frank sail
#

Yes

wicked notch
#

So I can just use gl_FragCoord.z

#

epic

wicked notch
#

I have huge respect for Unreal Engine devs

#

I already had huge respect but now that I'm trying to do what they do

#

It's incredibly painful bleakekw

#

nsight does not support R64_UNORM textures

#

The dudes over at epic really just did Nanite without debuggers bleakekw

wispy spear
#

or has tools which can handle it

#

or early versions of nsight etc, im sure there is some cooperation happening

wicked notch
#

Likely yeah, they probably have their own debuggers and tools tbh

wispy spear
#

that should encourage you to make one yourself too : >

wicked notch
#

I am only one human being

wispy spear
#

but a smart one

wicked notch
#

It's a numbers problem KEKW

wispy spear
#

doesnt make you less schmart 🙂

wicked notch
#

Ladies and gents

#

We got the thing

#

we have depth, meshlet ID and primitive ID inside a 64bit framebuffer

#

hybrid software/hardware rasterization coming soon™️

#
#version 460
#extension GL_ARB_separate_shader_objects : enable
#extension GL_EXT_shader_explicit_arithmetic_types : enable
#extension GL_EXT_shader_image_int64 : enable
#extension GL_EXT_mesh_shader : enable

layout (location = 0) in i_vertex_data_block {
    flat uint i_meshlet_id;
};

layout (r64ui, set = 0, binding = 1) uniform u64image2D u_visbuffer;

void main() {
    const uint64_t depth = uint64_t(floatBitsToUint(gl_FragCoord.z) & 0x3fffffffu);
    const uint64_t payload =
        (uint64_t(depth) << 34) |
        ((uint64_t(i_meshlet_id) << 7) & 0x07ffffff) |
        ((uint64_t(gl_PrimitiveID)) & 0x7f);
    imageAtomicMax(u_visbuffer, ivec2(gl_FragCoord.xy), payload);
}
``` world's weirdest fragment shader ![KEKW](https://cdn.discordapp.com/emojis/666849321462792234.webp?size=128 "KEKW")
wispy spear
#

name the constants too please

#

(not location/binding/set 🙂)

wicked notch
#

Uhh

#

I think position is kinda broken bleakekw

#
vec3 unproject_depth(in float depth, in vec2 uv) {
    const vec4 ndc = vec4(uv * 2.0 - 1.0, depth, 1.0);
    const vec4 world = inverse(u_camera.data.pv) * ndc;
    return world.xyz / world.w;
}
const vec3 position = unproject_depth(depth, gl_FragCoord.xy / vec2(resolution));
#

Am I stupid or is this fine?

frank sail
#

your world variable should be called clip

#

or something bleakekw

wicked notch
#

I don't think that's what's causing the problem bleakekw

#

Also you sure?

frank sail
#

anyways, looks okay

wicked notch
#

If you invert PV and transform NDC you get M

frank sail
#

yeah so you actually just have world but with funny w

wicked notch
#

world_with_funny_w

wispy spear
#

looks cool though

wicked notch
#

I fixed

#

somehow gl_FragCoord.xy / resolution is different than getting uvs from vertex shader

wispy spear
#

barythingy vs perspective perhaps?

wicked notch
#

Probable

wicked notch
#

very correct interpolation

wispy spear
#

leave it like that

#

now make the game 🙂

wicked notch
#

this is required to make the game

#

Very important to have nanite KEKW

wispy spear
#

good progress either way 🙂

wicked notch
#

We have le normals

#

I can still do analytical partial derivatives let's goo

#

I don't understand the need to rescale the derivatives though thonk

#

perhaps doing the math by hand could be beneficial

wicked notch
#

Is EmitMeshTasksEXT just a fancy vkCmdDispatch except it's in a task shader? frog_thinkk

frank sail
#

Yes

wicked notch
#

Hmm

frank sail
#

Too late, I sleep now. Gn frogeheart

wicked notch
#

Gn sir

#

By the time you will awoke from your slumber, meshlet culling shall be fully functional

wicked notch
wicked notch
#

Hmm, I don't understand glsl subgroupBallotExclusiveBitCount

#

gl_SubgroupSize could be 64 in case of AMD, so how does this return uint? frog_thinkk

#

Does subgroupBallotExclusiveBitCount perhaps count bits up until gl_SubgroupInvocationID (excluded)?

wicked notch
#

subgroupBallotExclusiveBitCount returns uint, that's 32 bits, not enough to hold all subgroup ballots (for wave64's)

proven laurel
#

uint subgroupBallotExclusiveBitCount(uvec4 value) returns the exclusive scan of the number of bits set in value, only counting the bottom gl_SubgroupSize bits (we'll cover what an exclusive scan is later).

#

number of bits

wicked notch
#

You see, I can't read

proven laurel
wicked notch
#

Makes sense then

#

only counting the bottom gl_SubgroupSize bits is an extremely covoluted and cryptic way of saying "we only count bits up until the current subgroup's ballot excluded"

#

Unless that's not actually what it's saying

#

It would make sense though

#

Since you can do stuff[base + subgroupBallotExclusiveBitCount(vote)] = other_stuff

#
layout (local_size_x = WORKGROUP_SIZE, local_size_y = 1, local_size_z = 1) in;

layout (push_constant) uniform pc_data_block {
    uint64_t meshlet_address;
    uint64_t vertex_address;
    uint64_t index_address;
    uint64_t primitive_address;
    uint64_t transforms_address;
    uint meshlet_count;
};

taskPayloadSharedEXT task_payload_t payload;

void main() {
    const uint meshlet_id = gl_GlobalInvocationID.x;
    const bool is_visible = meshlet_id < meshlet_count /*&& frustum_cull(meshlet_id)*/;
    const uvec4 vote = subgroupBallot(is_visible);
    const uint surviving = subgroupBallotBitCount(vote);
    const uint offset_index = subgroupBallotExclusiveBitCount(vote);
    payload.base_meshlet_id = gl_WorkGroupID.x * WORKGROUP_SIZE;
    payload.meshlet_offset[offset_index] = uint8_t(gl_LocalInvocationID.x);
    if (gl_LocalInvocationID.x == 0) {
        EmitMeshTasksEXT(surviving, 1, 1);
    }
}
``` World's dumbest task shader
#

holy shit it works

wicked notch
#

Mfw it's easier to write task shaders than to do frustum culling with infinite/reverse Z projections

#

I am dumb and stupid

#

I always forget to do prim_count * 3 instead of just prim_count

#

ffs

wicked notch
#

Turns out the projection is fine

#

My plane extraction method also works fine (I think)

wicked notch
#

It took a long while

#

But we did it

#

world's most efficient rasterizer

#

But we do not stop here

#

We can be efficienter

raven orchid
#

is this the page where you'll be documenting nanite impl progress?

wicked notch
#

Yes

#

It's moslty memes + me ranting about stuff I don't like though KEKW

#

Notice that I use the easy way out, i.e I use mesh shaders

#

Nanite emulates them, I do have an OpenGL prototype for mesh shader emulation, but it's a pain to work with

raven orchid
#

Awesome gonna follow this

#

Do they do that so they can support apis that don’t have mesh shaders?

wicked notch
#

Yeah

#

You could (in theory) run nanite on 10 year old GPUs if you really wanted to KEKW

#

The minimum requirement is just 64 bit buffer atomics

wicked notch
#

Alright next on the list are

  • HiZ occlusion culling
  • Primitive culling
  • Cluster screenspace area classification
#

And hopefully I can begin with compute rasterization soon™️

wicked notch
#

Hmm, gltfpack doesn't seem to be able to generate instances on its own

#

I mean it makes sense that one model = one instance, but eh

wicked notch
#

cgltf chokes on EXT_mesh_gpu_instancing nervous

wicked notch
#

I have been pondering

#
struct meshlet_glsl_t {
    uint32 vertex_offset = 0;
    uint32 index_offset = 0;
    uint32 primitive_offset = 0;
    uint32 index_count = 0;
    uint32 primitive_count = 0;
    uint32 group_id = 0;
    alignas(alignof(float32)) aabb_t aabb = {};
};```
#

Given that my meshlet struct is about 48 bytes

wicked notch
#

Soon to become just 40, replacing the AABB with the sphere

#

Say a mesh subdivides in N meshlets and this mesh has M instances

#

Does it make sense to have N * M meshlets?

#

There will be a ton of redundancy...

wispy spear
#

do you predict that this change will have a big positive impact on $PERFORMANCE?

wicked notch
#

I don't care about that yet

#

I am just brainstorming how to do instanced rendering with meshlets

#

But it will have a huge impact on memory

#

Right now I'm uploading M times the same vertices to the GPU, over and over again nervous

wispy spear
#

hmm

wicked notch
#

#graphics-techniques message

wispy spear
#

pinned it too, jaker talks so much it would just go under again

wicked notch
#

Thanks m8

wicked notch
#

Instances!

wicked notch
#

Hmm my instancing is borking meshoptimizer somehow

#

It can't generate a proper vertex remap for bistro only

#

Perhaps I am breaking some assumption meshopt makes

wicked notch
#

Just now I realize how small a number 134217728 is nervous

#

With big scenes the number of meshlets is insane

#

Powerplant alone is 12 million (instanced) meshlets

wispy spear
#

oof

wicked notch
#

I have now reduced the number of meshlets considerably

#

at the cost of more memory

#

for fucks sake

#

Why can't I have 1 terabyte of VRAM

#

Everything would be so much easier

wicked notch
#

I was debugging why displaying normals would cause a device lost

#

Even though displaying meshlet ID would work just fine

#

Turns out I was passing 0 as the meshlet instance buffer address

#

How the hell was it working before

#

???

#

Alright now powerplant is 200628 meshlets froge

raven orchid
#

Does performance improve to a point with more meshlets or degrade? Is it a balancing act to keep the number within a good range?

wicked notch
#

Of course performance scales pretty much linearly with the number of meshlets, but occupancy stays the same

#

Whether you send 100 meshlets or 100000 meshlets the GPU will happily process them at full speed

#

Perhaps it would be a cool experiment to merge ALL meshes into a single huge mesh and derive meshlets from that

wispy spear
#

powerplant's primitives are also quite awkward

#

all the pipes are one, iirc for example

wicked notch
#

Hmm

#

I currently do depth testing with a shrimple imageAtomicMax

#

Actually, nevermind whatever I was thinking KEKW

wicked notch
#

I really don't like HiZ

#

It's overly conservative at times, you have to handle disocclusion events, frame 0 is a special case

wispy spear
#

you could play Rust or PubG instead

wicked notch
wispy spear
#

didnt you ponder around the other day, that hiz doesnt do anything for your already good meshlet renderisms?

#

or is that a different thing

wicked notch
#

Perhaps I misspoke, occlusion culling is definitely useful

#

I just have a personal feud with HiZ KEKW

#

The impl is also straight from Niagara so..

#

Anyways, I don't think it's practical to do ROC with meshlets, far too many AABBs lol

#

The culling would not be worth it I think, perhaps I could experiment

#

It wouldn't integrate very well with the TASK/MESH pipeline but eh

#

It's just one more buffer, what's wrong with that

#

What do you guys predict, will ROC be worth it?

#

Bets are open

frank sail
#

republic of china

wicked notch
wispy spear
#

ah

wicked notch
#

Tbh I kinda want to just leave HiZ as is and come back to it later

#

I would like to move onto the next step, which is cluster area/error estimation

wispy spear
#

isnt roc "just"(tm) yaymd 's new cuda?

wicked notch
#

And software rasterization for the big gains

wispy spear
#

ah you are tralking about something else

#

republic of coomers

wicked notch
#

Yes, ROC is raster occlusion culling

wispy spear
#

oops : > you got me

wicked notch
#

Anyways I have a small problem

#

Wpotrick did not do the cluster area estimation yet

wispy spear
#

there are exercises for that

wicked notch
#

Which means I am on my own nervous

wispy spear
#

perhaps pester him to figure something out with you

#

you are the only one actually trying to achieve something here

#

@glass sphinx lustri is complaining that you did not do cluster area estiminiation yet

glass sphinx
#

im too lazy

wicked notch
#

I am not

wispy spear
#

put your heads together

#

get it done

glass sphinx
#

i only touch opengl with a stick

wispy spear
#

🥢

#

here have 2

glass sphinx
#

we can plan it together

wicked notch
#

Worry not, I am currently using our lord and savior vulkan

glass sphinx
#

goooood

#

so

#

did you ever want to join a cult before?

#

i have good news

#

we are recruiting

wicked notch
#

Incredible

#

What kind of cult

glass sphinx
#

first you need to sign a clause that makes you my property forever

#

i am gathering the convincement crew

delicate rain
#

It's a good and friendly cult

#

you only need to recruit 5 more ppl as a payoff

#

for getting introduced

#

it's worth it tho

delicate rain
#

which you seem you are

glass sphinx
#

are you a c++ person @wicked notch ?

left jacinth
#

Did somebody say Daxa?

wicked notch
#

They're raiding me nervous

glass sphinx
#

you have no escape

wicked notch
#

deccer look what you did

glass sphinx
#

we come and conquer

wicked notch
#

Anyways, I'm honored about this invitation, but I'm afraid I can't join your cult 😦

glass sphinx
#

👹

glass sphinx
wicked notch
glass sphinx
#

anyways we need more daxa people to compensate my lazyness get even more features in

#

if you ever feel the need to completely rewrite everything for no reason in daxa we are eating you alive always here

wicked notch
#

Sure, I will eventually get tired of writing stuff on my own bleakekw

glass sphinx
#

merge it into daxa

delicate rain
#

Suffering is meant to be shared

glass sphinx
#

real

#

real

delicate rain
#

especially when it's caused by GP

wicked notch
glass sphinx
#

btw is there a way to see the post quickly?

#

it seems like i must scroll up throu 100 quadrillion messages

wicked notch
#

Ah yeah, unfortunately I did not have the foresight to pin the initial thing bleakekw

#

But worry not, I just recently started with Vulkan

#

I went straight for mesh shaders (the EXT one)

glass sphinx
#

template addict very good

left jacinth
#

Holy shit Patrick relax on the cult behavior

glass sphinx
#

how do they miss obvious features like that i cant see the original post

wispy spear
#

before you fine daxa people try to sell more carpet to my italic friend here, help him figure out cluster-mesh-thingy

#

then he's all yours 😄

wicked notch
#

Look at him, abandoning me like this

glass sphinx
wispy spear
#

i get a commission, its worf it

left jacinth
#

Lmao

delicate rain
glass sphinx
#

crimes

left jacinth
#

??

#

What the balls are you saying saky

glass sphinx
#

@wicked notch can you link and pin the github?

wicked notch
#

Sure

#

One sec

frank sail
#

discord is very cool

wicked notch
#

I suppose you won't be much interested in the GL one

glass sphinx
#

soon daxa garbage pure gold

wispy spear
#

its not garbage, its brainworm material tbf

left jacinth
#

Daxa is not garbage

#

Daxa is literally the best vulkan abstraction possible

#

Except it's missing a couple things 💀

wispy spear
#

no sales pitches we need solutions

glass sphinx
#

the code looks quite clean

#

better then other things i have seen here

glass sphinx
#

i like your code

wicked notch
#

Thanks, it's missing a lot of niceties though

delicate rain
#

The code looks awesome

wicked notch
#

A proper render graph for example bleakekw

glass sphinx
#

damn this code is super clean

delicate rain
#

I mean, thats a pretty tall task for a nicety 😄

glass sphinx
#

task list burned oput so much of by brain

#

since task list i cant type properly anymore

wicked notch
#

The worm needs more brain mass

frank sail
glass sphinx
#

😿

#

stypes to be pain

delicate rain
#

we had pain bugs caused by ommiting them by accident in some places

wicked notch
#

As expected, ROC does not perform very well

#

I'm conflicted

wicked notch
#

I have ordered 128GB of RAM

#

finally we can fit Moana Island in System Ram KEKW

wispy spear
#

😄

wicked notch
#

I don't understand how Darianopolis managed to export FBX in unreal tbh

#

I'm trying it with Moana assets and it exports a botched version, the lowest LOD possible KEKW

wispy spear
#

summon him and start the interlocution

glass sphinx
#

THERE ARE FOUUUUURR LIGHTS

wispy spear
#

at last

glass sphinx
#

tng ist wirklich gut

glass sphinx
#

@wispy spear pro tip: von den neuen star treks (die all poopy sind) ist die neuste (strange new worlds) wirklich gut

#

grosse empfehlung

#

pike ist wundervoll

wispy spear
#

@glass sphinx you're late

#

hab alles schon gesehen

#

und ja du hast recht

#

was is ziemlich kacke finde is spock und die schwester, fand seine frau viel cooler

#

und die nunien tzung tante nervt auch jedesma mit ihrem gorn shit

#

schade das der andorianer maschinenraumfutzi wech is, der war der beschde

wispy spear
#

das a jacky chan movie

#

we're just lamenting about the latest star trek isms, where strange new worlds is the better out of the 3 new shows

wicked notch
#

Star trek things aside

#

I think I have the best possible implementation of HiZ my brain can manage

#

And it is reasonably conservative

frank sail
#

aliasing aaaaaaah

#

jk nice cooling

wicked notch
#

FSR2 will come soon™️

#

Also is it just me or are these normals fucked

#

This is the first I've seen this nervous

frank sail
#

hm

glass sphinx
#

super sad

#

aber hatte sehr gute folgen

frank sail
#

if you don't remap your normals, half of them will be black

wicked notch
#

Yeah, I meant the green being -X

frank sail
#

o

wicked notch
#

On the little thingy in the middle

frank sail
#

da green been

#

hehe

glass sphinx
#

n * 0.5 + 0.5

wicked notch
#

ye this do not be good

glass sphinx
#

huuh

wicked notch
#

So uh

#

Area of a triangle is bh/2

#

What's the area of a bunch of em

frank sail
#

bh/2 * numberOfTris

wicked notch
#

Course it is

frank sail
#

The formula to calculate the area of a regular polygon is, Area = (number of sides × length of one side × apothem)/2, where the value of apothem can be calculated using the formula, Apothem = [(length of one side)/{2 ×(tan(180/number of sides))}].

wicked notch
#

Das a lot of data

#

Well not a lot

#

But hm

#

I wonder how Unreal does it, I was thinking of computing the area of a clip space projected AABB around the cluster

#

So like, (box.z - box.x) * (box.w - box.y)

#

But this is extremely skewed towards hardware rendering

#

And hardware rendering is cringe

#

I should also determine clusters whose triangles are going to be clipped

#

Given that we do things per vertex, I could check if vertex.xy > 1.0 || vertex.xy < -1.0

#

And mark the cluster as hardware rasterizeable if so

#

Cutting the area of the projected AABB in half could be good

#

Depending on uh, literally everything

#

Jaker, could you lend me your braincell

frank sail
#

I can lend a froge a braincell

#

ok so you're trying to see how big a cluster is in screen space as a heuristic to determine which rasterizer to use?

wicked notch
#

yes

frank sail
#

did you already give up on getting the bounding box

wicked notch
#

Later on to determine which lod to use, but that's a story for future me

#

I have not

#

It's the only way I could think of bleakekw

#

Of course I will try it, I would like to hear other smorter/dumbere ways

wispy spear
#

das ist cool btw, just saw the motion picture in all normal glory

wicked notch
#

I still don't have textures btw bleakekw

wispy spear
#

soon(tm)

wicked notch
#

I won't introduce a memory bottleneck immediately so that I can still see gains from my ridiculous quest towards Nanite

#

Or well, a scuffed, dumber and worse version of Nanite KEKW

frank sail
#

btw there is a way to get the screen space area of a projected sphere

wicked notch
#

Inigo Quilez to the rescue pog

frank sail
#

I think it might be more ideal than the screen space AABB in some instances

#

idk if it's better in general

wicked notch
#

L2 broke through its own limits

glass sphinx
#

profilers do be sniffing glue sometimes

wicked notch
wicked notch
#

At least clip detection is working (no white pixels at the edges)

#

Alright we did it boys

#

Blue = Small Area = Software Raster
Red = Big Area = Hardware Raster
Black = Clipped = Hardware Raster

#

Everything converges to blue as distance grows as expected

wicked notch
#

Now I gotta do soft rast nervous

wispy spear
#

damn this thing is getting better and better ❤️

summer gyro
#

Is this still opengl ?
Or have you wandered off to vulkan ?

#

This stuff looks soo cool

wicked notch
#

I have unfortunately defected to Vulkan 😦

#

But I still have good uses for GL

summer gyro
#

Fair enough

#

I also recently started learning vulkan
And the validation layers are so much better than the debug call backs

glass sphinx
#

aha uhu

#

tell me

wicked notch
#

Well uh

#

I have IrisGL that works pretty well, it has shadows and all, so I can use that thing as a reference sometimes

#

Don't slander my boy GL 😭

wispy spear
#

oh?

#

this is vookan already?

wicked notch
#

Yes, the last updates have been in Vulkan

glass sphinx
#

LVSTRI one of the people i would hire if i could

#

workoholics that write pretty code are very good for business

wicked notch
#

By the way I managed to export a cute scene from UE5

#

It's pretty nice

raven orchid
wispy spear
#

wouldnt surprise me if lustri interops some dx12 into the mix for some obscure reason heh

wicked notch
#

I did not go that far into madness yet bleakekw

#

It's fully Vulkan

wicked notch
#

I think it's finally time to add textures

wicked notch
wicked notch
#

RAM has arrived let's goo

#

I shall use every last friggin byte

wicked notch
#

Man, 128GB of RAM feels truly liberating

#

Unreal uses up to 64 and I still have 64 left KEKW

wicked notch
#

with a super long delay, textures (meshlet flavor)

raven orchid
#

Hey nice!

#

What causes it to need so much system ram? Are you having to stream from system to vram a lot or does it mostly stay in system?

wicked notch
#

Not yet, I got so much RAM because of Unreal Engine (which I use as editor) and blender nervous

#

They were taking up so much RAM I couldn't bear it anymore

proven laurel
proven laurel
wicked notch
proven laurel
#

converting Bistro from FBX to GLTF was my breaking point to go from 16GB to 32GB KEKW

finite quartz
#

I went from 32 to 64 because I was often swapping because of Painter... froghorror (We like RAM and VRAM a lot)

proven laurel
#

3D software just eats up all ram and vram lol

finite quartz
#

we eat all the vram we can :p

proven laurel
#

well I'll see if my GPU has enough for the stuff I want to do KEKW

wicked notch
#

Hmm some clusters have triangles scattered about all over the place

#

The hell is meshoptimizer doing

wicked notch
#

It is now time

#

compute rasterizer

#

except I am sleep

#

So that is deferred to tomorrow KEKW

glass sphinx
#

btw what is your educational status?

#

hs? uni? working?

frank sail
#

uni

#

(I am observing lvstri through walls)

glass sphinx
wicked notch
#

Jaker do be livin in my walls

wicked notch
#

bistro if you draw only big triangles

wicked notch
#

Hmm

#

How should I schedule work for my compute rasterizer

#

Perhaps one workgroup per meshlet and one thread per primitive?

#

It's gonna have a lot of dead invocations...

#

primitive_count is never going to be MAX_PRIMITIVES

#

More indirection could solve this though bleakekw

wicked notch
#

World's most absolutely unhinged and stupid software rasterizer bleakekw

#

polygon fill: todo

#

Hmm this isn't very promising

#

Granted this is probably the most inefficient way possible of doing a compute rasterizer

#

But right now it's also not doing much...

#

Perhaps I am spending a lot of time idle

#

Uhhhh

#

How the hell do I read this?

#

Oh well, I can see that the only two if statements are taking up a combined 120% of the time spent in the shader bleakekw

wicked notch
#

I think the added indirection is necessary

#

I am spending over 300 microseconds idle

wispy spear
#

is ther a way to get rid of that if by somehow sorting the data before hand?

wicked notch
#

Yeah

#

I need to do an extra processing step

#

I should make a buffer with all software rasterized meshlets and another with an index to all software rasterized primitives

#

Actually no

#

it's probably better if I make a single buffer with primitiveID | meshletID << 7

#

WG size will be 256

#

And we round down as usual

#

looping over all primitives vs looping over all meshlets hmmmmmm

#

deep ponderation

#

Looks like nanite does the former

#

And it is somehow faster

#

Even deeper pondering

wicked notch
#

I am dispatching 476394 workgroups, each workgroup has a local size of 128

#

The average number of primitives is about 64

#

Which means that of 60978432 threads, 30489216 are doing nothing bleakekw

wispy spear
#

fouf, sounds like a lot wasted potential

proven laurel
#

with meshlets you generally want to try to merge as much as possible

#

otherwise you get low occupancy

wicked notch
#

Perhaps making smaller meshlets is better

#

As a quick sanity check I tried making the meshlets smaller, but the time took by rasterizer is the same....

wicked notch
#

Any change I make impacts minimally, looks like idle threads are not the bottleneck?

#

I strongly believe I am doing something wrong, I will inspect more closely

#

The barriers and loads are taking up most of the time, hmm

wicked notch
#

@frank sail could you help a smol-brained frog in need?

frank sail
wicked notch
frank sail
#

no

#

what are the throughputs

wicked notch
#

This is with hardware raster (consider only meshlet_cull_and_draw)

frank sail
#

what part of the frame am I looking at

wicked notch
#

meshlet_cull_and_draw (I have disabled culling btw, for testing hw/sw)

frank sail
#

hmm

#

what's the occupancy

wicked notch
#

This is software

wicked notch
#

Thread coherency is also 99%

#

The software rasterizer also doesn't do primitive filling, it just renders points

#

I could show you the code if you are interested

#

But it's really basic

#
shared meshlet_data_t meshlet;
shared vec3[64] vertices;

void main() {
    const uint meshlet_instance_id = gl_WorkGroupID.x;
    const uint meshlet_id = meshlet_instances[meshlet_instance_id].meshlet_id;
    const uint instance_id = meshlet_instances[meshlet_instance_id].instance_id;

    if (is_candidate_sw_raster(meshlet_id) && gl_LocalInvocationID.x == 0) {
        meshlet = meshlet_data[meshlet_id];
    }
    barrier();

    if (!is_candidate_sw_raster(meshlet_id)) {
        return;
    }

    const mat4 transform = instances[instance_id];
    if (gl_LocalInvocationID.x < meshlet.index_count) {
        vertices[gl_LocalInvocationID.x] = fetch_vertex_and_project_to_ndc(gl_LocalInvocationID.x);
    }
    barrier();

    const uint primitive_id = gl_LocalInvocationID.x;
    if (primitive_id < meshlet.primitive_count) {
        const vec3[] triangles = rasterize_triangles(primitive_id);
        imageAtomicMax(visbuffer, ivec2(triangles[0].xy), triangles[0].z);
        imageAtomicMax(visbuffer, ivec2(triangles[1].xy), triangles[1].z);
        imageAtomicMax(visbuffer, ivec2(triangles[2].xy), triangles[2].z);
    }
}```
#

This is the gist

#

The things that take the most are the two barriers (and I have no idea why)

frank sail
#

remove the barriers bleakekw

wicked notch
#

wg size is (128, 1, 1) and invocation size is (meshlet_count, 1, 1) btw

#

Or one workgroup per meshlet, one thread per primitive

#

It is almost a 1 to 1 copy of what unreal does bleakekw

frank sail
#

hmm I guess you can't really change the wg size

#

but yeah barrier with a big wg will be slow

wicked notch
#

If I make the WG smaller I get 3x worse perf

frank sail
#

ouphe

wispy spear
#

make wg bigger then anismart

wicked notch
#

2x worse perf

frank sail
#

make it samer

wispy spear
#

heh

#

splitting this into multiple passes would not help?

wicked notch
#

Is this barrier really that destructive?

if (gl_LocalInvocationID.x == 0 && is_sw_rast) {
    meshlet = meshlet_ptr.data[meshlet_id];
}
barrier();```
frank sail
wispy spear
#

ah

wicked notch
#

It it copying something like, 64 bytes of data in shared memory

#

How is this barrier taking half the time spent in the shader

frank sail
#

well it's possible it's just an artifact of how the profiler reports things

#

in RGP, actual load instructions will appear very cheap, but then their cost will show up at s_waitcnt instructions

wicked notch
#

Right

#

except there are literal noops before this barrier

#

except the load

frank sail
#

just one teensy weensy little load

wicked notch
#

Like legit it's just this

void main() {
    const uint meshlet_instance_id = gl_WorkGroupID.x;

    const uint meshlet_id = instance_ptr.data[meshlet_instance_id].meshlet_id;
    const uint instance_id = instance_ptr.data[meshlet_instance_id].instance_id;
    const uint primitive_id = gl_LocalInvocationID.x;
    const bool is_sw_rast = cluster_class_ptr.data[meshlet_instance_id] == CLUSTER_CLASS_SW_RASTER;
    
    if (gl_LocalInvocationID.x == 0 && is_sw_rast) {
        meshlet = meshlet_ptr.data[meshlet_id];
    }
    barrier();
}
#

Oh wait there is another load

#

the classification

#

Let me try forcing SW

#

It's the same

#

I am dying

#

Alright

#

I will rasterize NOTHING

#

?????????????????????

#

16 milliseconds for doing nothing

#

What the hell is happening

frank sail
#

are you dispatching a lot of wgs or something

wicked notch
#

uh

#

maybe

#

is 500'000 considered a lot

#

btw I cracked it

#

it was the imageAtomicMax

#

I have to thank the profiler for misleading me and doing absolutely nothing to help me figure it out KEKW

#

Now it's taking the same as the hardware rasterizer

#

Which is still terrible

#

it should be taking 3x less time than HW raster in this particular case (according to unreal)

glass sphinx
#

well

wicked notch
#

Was I wrong to expect a crazy speed up maybe?

#

Ah beautiful

#

On todays episode of: things that make no sense

#

Thai scene: 30 million primitives so many small triangles, compute matches raster (compute should be faster)
Bistro scene: 5 million primitives and many big triangles, compute is faster than raster

#

heh

glass sphinx
#

maybe your heuristic to check for median tri size is off

wicked notch
#

I am doing full software vs full raster right now

glass sphinx
#

ah ok

distant lodge
#

I'm surprised that it's possible to match/beat raster hw at all

frank sail
#

its perf breaks down when you have really bad quad occupancy

distant lodge
#

man I should've started following earlier

frank sail
#

with thin or tiny tris

distant lodge
#

is one of the drawbacks that you need to store a buffer of all triangles ever instanced in your scene? I can't imagine you can beat the hardware vertex cache

frank sail
#

idk how much of a perf uplift hw vertex reuse is, but clearly it's not unbeatable

#

you also lose a lot of potential vertex reuse when you render unconnected meshlets

distant lodge
#

true

#

I wonder what the total memory requirement is to render bistro is (minus images)

frank sail
#

btw, I wonder how much vertex reuse you can get with shared memory

#

oh wait

#

I think lvstri is already loading verts to shared mem

distant lodge
#

oh I think I get it, because you're working with meshlets you have a known bound on the triangles you're working on

#

so your shared mem gets loaded with your meshlet's vertices and you go from there

frank sail
#

I guess you could do something like this

layout(group_x = 128) in; // max number of verts in a meshlet

fetch and shade vertex[localInvocationID]
store transformed vertex in shared memory
barrer()
if (localInvocationID < numPrimsInThisMeshlet)
  assemble primitive[localInvocationID]
  rasterize primitive
distant lodge
#

yeah, rasterizing 1 primitive per thread sounds kinda funky though

#

is that what the hardware does

frank sail
#

well, there is dedicated hw for rasterizing prims, so it's super fast

#

but in this case, your prims are only a few pixels large at most

distant lodge
#

though I guess since we're talking small triangles specifically that this is used for

#

yeah

#

software rasterization in compute always seemed pretty cumbersome to me because both the vertex and fragment operations essentially decompress into a ton more data to process in the next stage

#

but it makes sense how this technique deals with it

frank sail
#

it works well in very constrained situations

#

in nanite, they're working with tiny triangles in a visbuffer-like renderer (so the fragment shader is literally just writing depth and a triangle/instance ID)

distant lodge
#

yeah

#

deceptively simple and insanely clever

frank sail
#

and best of all, infinitely bikesheddable

glass sphinx
#

you can do the same tricks as meshshaders do

#

you shade all verts in a meshlet within a workgroup

#

then share the results and create triangles

#

then rasterize

frank sail
#

my frogge

glass sphinx
#

that can and will beat the hw vertex cache hard

glass sphinx
#

the only strong usecase with software raster is to write a visbuffer with depth

cedar seal
#

frog shading

wicked notch
#

I think I cracked the code

#

Perhaps the reason my compute raster was so garbage, was due to unconditionally imageAtomicMax'ing

#

I should've just done the classic

    for (uint x = min.x; x < max.x; ++x) {
        if (is_inside_triangle(x, y, ...)) {
            imageAtomicMax(visbuffer, ...);
        }
    }
}```
raven orchid
#

I’m guessing when you switched to conditional atomic the level of contention went way down?

wicked notch
#

I'm still testing right now, results will be in soon™️

wicked notch
#

Alright results are in

#

And what sad results these are, occupancy remained the same, after all, rasterizing pixel sized triangles is quite easy

#

As did the time took for Thai (2.83)

#

I am going to assume something is fundamentally wrong with the way I build this software rasterizer

#

Until I find someone to pester about this, software raster is on hold 😦

distant lodge
#

rip, what algorithm do you use to actually rasterize btw

wicked notch
#

Perhaps I should try Unreal's algorithm as well

distant lodge
#

which algorithm? scanline? checking if a pixel is in the triangle in an AABB?

wicked notch
#

For each pixel in the triangle bounds yes

distant lodge
#

and you're rendering 1 triangle/thread?

wicked notch
#

Yeah it is quite bad, but somehow still manages good occupancy

#

Yes one prim per thread

#

I'll try unreal's approach

distant lodge
#

what's unreal's approach?

wicked notch
#

They have a hybrid scanline/pixel in AABB method

#

They choose one based on triangle screen footprint

distant lodge
#

it 404s

#

what's the secret to read UE code

wicked notch
#

Let me find the sacred link once again

distant lodge
#

oh gross I need a UE account and need to connect it

wicked notch
#

Yes, very sad

wicked notch
#

I was thinking about the way I classify meshlet area

#

Could I compute the perfect area of a cluster in local space at load time and then scale that based on view distance and the transform's scale? thonk

#

Well it doesn't really matter right now, gotta solve sw raster first, I'll give it a few tries more and then move on

wispy spear
#

does that not depend on how (as in where) you look at the mesh, which you cant possibly know at load time?

wicked notch
#

Yes, at load time you compite a "baseline", the true area of the cluster in local space

#

Then, the idea is to scale that area based on view distance and transform's scale

#

Hmm perhaps this does not work with disconnected clusters (i.e. triangles that are not connected but share the same cluster ID)

distant lodge
#

aren't you generally trying to avoid having those though

wicked notch
#

Yes but meshoptimizer can't help but make disconnected cluster sometimes

#

I might make my own meshletizer

#

Or hack into meshoptimizer and fix that "bug"

wispy spear
wicked notch
#

I have reached a conclusion

#

Actually two conclusions

#

Conclusion #1: I was indeed doing fundamentally flawed calculations

#

Conclusion #2:

void rasterize(in vec3[3] triangle, in uint64_t payload) {
    const vec4 bounds = make_bounding_box(triangle[0], triangle[1], triangle[2]);
    const uint start_x = uint(bounds.x);
    const uint start_y = uint(bounds.y);
    const uint end_x = uint(bounds.z);
    const uint end_y = uint(bounds.w);
    for (uint x = start_x; x < end_x; ++x) {
        for (uint y = start_y; y < end_y; ++y) {
            const vec3 barycentric = make_barycentric(triangle[0], triangle[1], triangle[2], vec2(x, y));
            if (barycentric.x < 0.0 || barycentric.y < 0.0 || barycentric.z < 0.0) {
                continue;
            }
            const float z = dot(barycentric, vec3(triangle[0].z, triangle[1].z, triangle[2].z));
            imageAtomicMax(u_visbuffer, ivec2(x, y), (uint64_t(floatBitsToUint(z)) << 34) | payload);
        }
    }
}
``` This is pure and utter garbage ![bleakekw](https://cdn.discordapp.com/emojis/1082598350303539240.webp?size=128 "bleakekw")
#

it doesn't respect any rasterization spec ever created

wicked notch
#

Am I overthinking this? What the hell is NaniteViewAndInvViewSize and NaniteViewRect

wicked notch
#

I am so sad

#

Unreal's rasterizer literal copy does nothing to help

glass sphinx
wicked notch
#

RTX 3070

glass sphinx
#

strange

wicked notch
#

Perhaps their clusterizer is that much better than Meshoptimizer?

wicked notch
#

HLSL peeps, what does select do exactly

#

It's scary how little HLSL is documented

wispy spear
#

hmm never seen select in hlsl before, only know it from socket nonsense

wicked notch
#

It's old behaviour for ?: apparently

#

Thing is, what ?: does isn't documented either for vector types KEKW

wispy spear
#

: (

#

sad times

minor root
#

Guessing purely from naming select is any(…)

#

Then if it’s different from :? then that should be all

frank sail
#

select sounds more like glsl's step

wicked notch
#

Regardless, something is, once again, fundamentally wrong

#

Mesh shaders are nice but they lack flexibility in choosing a meshlet's size

#

On NV it's either 64/126 or death

wispy spear
#

write about it, perhaps it tickles some $GPUVENDOR engineer's interest

wicked notch
#

I don't think they will change their schtuff because I can't make a compute rasterizer efficient bleakekw

wispy spear
#

heh

wicked notch
#

There's nothing wrong with 64/126 per se, it's the workgroup size mismatch that kills me

#

And NV likes 128 a lot more (for compute)

#

Task/Mesh is 32 only

#

But regardless, occupancy is fine, I'm always and forever limited by VRAM

frank sail
#

too bad you're in uncharted territory with this stuff

wispy spear
#

you could ask peeps to run your stuff on different hardware, if that helps

#

to collect some data

wicked notch
#

AMD hardware likes completely different things from NVIDIA's bleakekw

frank sail
#

package telemetry with your app to collect extra data

wicked notch
#

NVIDIA likes a WG of 32 for task/mesh, AMD likes one vertex/primitive per invocation

frank sail
wicked notch
#

Yeah I might

#

I don't have AMD hardware though bleakekw

#

Hey uh, Jaker

wispy spear
#

or compile 2 binaries

wicked notch
#

Could you send a little treat

minor root
frank sail
#

I send you my regards

wicked notch
#

A 7900xtx will suffice

minor root
#

A small offering for salvation

#

Inshallah lvstri will revolutionise rasterization

wicked notch
#

I am merely copying Unreal

minor root
#

Revolution!

wispy spear
#

peope who wrote those shaders for unreal are probably in the copyright notice or commit log

#

mayhaps reach out

wicked notch
#

I could

#

worst they could do is send Hitmen to terminate me due to copyright violation

wispy spear
#

or have you hired

wicked notch
#

Jaker

#

you are at AMD

#

explain what primitive shaders are and how they are different from mesh shaders

#

it's your patent after all

frank sail
#

what have you googled

wicked notch
#
"The vast majority of triangles are software rasterised using hyper-optimised compute shaders specifically designed for the advantages we can exploit," explains Brian Karis. "As a result, we've been able to leave hardware rasterisers in the dust at this specific task. Software rasterisation is a core component of Nanite that allows it to achieve what it does. We can't beat hardware rasterisers in all cases though so we'll use hardware when we've determined it's the faster path. On PlayStation 5 we use primitive shaders for that path which is considerably faster than using the old pipeline we had before with vertex shaders."```
from: <https://www.eurogamer.net/digitalfoundry-2020-unreal-engine-5-playstation-5-tech-demo-analysis>
frank sail
#
wicked notch
#

epic

frank sail
wicked notch
#

I wanted to cope with: "Maybe my mesh shader is so good it doesn't need a software rasterizer"

#

Or something cringe like that bleakekw

#

But it turns out the PS5 does something different

#

Damn you AMD

frank sail
#

forget about ps5

#

nanite runs on AMD PC GPUs too

wicked notch
#

Yes but they use vertex shaders

#

That means their soft rast path is faster*
*in certain cases

frank sail
#

well I guess get a ps5 devkit then

#

or get into driver dev

wicked notch
#

I might need to acquire some AMD hw

#

But that would be one hell of a detour bleakekw

frank sail
#

de2our

wicked notch
#

Noice I skimmed through looks great

#

Bookmarked

#

Jaker how hard is driver dev

frank sail
#

idk, I don't do it

wicked notch
#

who does it

frank sail
#

@ pixelduck @ nanokatze @ mohamexiety @ martty @ pac85

wicked notch
#

Is driver dev on NV impossible?

frank sail
#

noyes

wicked notch
#

They don't share anything and the only open source driver sucks (I heard at least)

#

Hmm a 6950xt costs 600 robux

frank sail
wicked notch
#

nearest? you mean in my walls

frank sail
#

we're roommates (in your walls)

wispy spear
#

toomuchvoltage is at nvidia iirc 😉

frank sail
#

working on drivers though?

wispy spear
#

not sure

frank sail
#

or devtech stuff mayhap

wicked notch
#

Btw

#

I went back to our old friend GL

#

And the vertex path here matches compute raster and mesh shaders on Vulkan as well

#

I have never seen 3 ridiculously different techniques agree on performance so much

#

What the hell

#

I want to profile unreal

frank sail
#

unreal is instrumented

wicked notch
#

It is time to compile unreal from source, wish me luck

frank sail
#

with gpu frame marquers

wicked notch
frank sail
#

I'm also stating that as a fact

wicked notch
#

I should first see if they list sw/hw timings

frank sail
#

because I have to profile unreal every day at work bleakekw

wicked notch
#

epic

#

do you know how to get nanite sw/hw timings

#

Sparing me from the tedious documentation crunching bleakekw

frank sail
#

uh you put D3D12.EmitRgpFrameMarkers=1 in DefaultEngine.ini and then profile it with RGP :^)

wicked notch
#

"go buy AMD hardware scrub"

frank sail
#

ue has frame markers for other thingies

#

just gotta figure out how to enable them

wicked notch
#

How do you profile then

frank sail
#

wdym

wicked notch
#

Connect from RGP to Unreal somehow?

frank sail
#

you just hook up your favorite gpu profiler to the game you're profiling

wicked notch
#

Ah, do you have to export the game

frank sail
#

for rgp, you just need to have rdp (the program that hooks into vulkan, dx12, and opencl apps) running before you launch the app

frank sail
wicked notch
#

Oh that's great then

#

I don't want to wait 4 months for Unreal to package my stupid app with one mesh in it bleakekw

#

Last time I tried packaging my test thingy it took 1 hour

#

It had literally no actors beside a static mesh

#

Perhaps I did sumthing wrong

frank sail
#

probably

wicked notch
#

I have done a lot of investigations

#

Rendering all kinds of scenes, bistro subdivided into oblivion as well bleakekw

#

I subdivided everything into at least 100 million triangles

#

And I am starting to see some gains, it appears that THE WHOLE viewport, has to be covered in pixel sized triangles for the HW raster to be much slower than the SW raster

#

Perhaps Nanite really is only needed for stupid amounts of triangles

#

i.e: 1 billion+

glass sphinx
#

nanite only renders aeound 100 mil at 4k i believe

#

sw raster can also be great to allow for larger view distances

#

or later lod switching

wicked notch
#

Yeah, also overdraw I guess is much better with SW for larger draw distances

wicked notch
#

sadness

#

Why does Windows freak out when I go overboard the physical memory limit, it's using far more than 48GB bleakekw

glass sphinx
#

blender is such a slow boy sometimes

wicked notch
#

I just re-read unreal's slides for the, uh

#

7th time KEKW

#

And they say "we software rasterize all clusters where at least one triangle is more than 32 pixels wide"

frank sail
#

huh

cedar seal
#

At least?

wicked notch
#

At most*

#

No I mean

cedar seal
#

More?

wicked notch
#

I can't fucking write bleakekw

cedar seal
#

No more?

wicked notch
#

If all triangles are less than 32 pixels wide they software rasterize

#

That makes sense

cedar seal
#

That would make sense yes

#

And hw rasterization rejects exactly same triangles?

wicked notch
#

Now, does it make sense to have a compute shader with invocation size meshlet_count and workgroup size (MAX_PRIMITIVES, 1, 1) that can check for that

#

It should also cull now that I think about it

wicked notch
cedar seal
#

Workgroup sizing has always been sort of a mystery to me

#

Only good way I know is to use microbenchmark

wicked notch
#

I've googled without results, where do I find that tool

cedar seal
#

I mean you write your own microbenchmarks that tell performance of different group sizes.

#

Brute force search essentially

#

Results can be and often are gpu specific, so the benchmarks need to be run on each installation.

wicked notch
#
shared meshlet_glsl_t s_meshlets[MESHLETS_PER_WORKGROUP];
shared mat4 s_pvm[MESHLETS_PER_WORKGROUP];
shared vec3 s_vertices[MAX_VERTICES * MESHLETS_PER_WORKGROUP];
shared uint s_primitive_size[MESHLETS_PER_WORKGROUP];``` Do you guys think this is too much shared data? ![bleakekw](https://cdn.discordapp.com/emojis/1082598350303539240.webp?size=128 "bleakekw")
raven orchid
#

How big are the constants

wicked notch
#

MESHLETS_PER_WORKGROUP is 4, MAX_VERTICES = MAX_PRIMITIVES = 64

wicked notch
#

Eh I expected nothing and of course this classificator is bad bleakekw

#

With this new classificator I have perfect accuracy

#

Except it takes about the same time it takes for me to rasterize the entire model bleakekw

#

How the hell does Unreal do this

#

damnit

glass sphinx
#

this is probably an area that needs a ton of testing on bug maps used in production

wicked notch
#

occupancy is great though KEKW

#

note: the "cull" part is a LIE, there is no culling right now

glass sphinx
#

MASSIVE

wispy spear
#

so what is the actual problem right now?

#

you can only rnder 10mio tris compared to UE's 100mio?

wicked notch
#

Only one problem

#

I can't efficiently classify clusters

wispy spear
#

compared to what

wicked notch
#

To right now

wispy spear
#

but in what does it manifest itself

wicked notch
#

1.5ms to classify clusters is terrible

wispy spear
#

ah

#

how long does UE take?

wicked notch
#

I have no idea, but considering the whole nanite pass takes less than 2 milliseconds..

wispy spear
#

its broken down into clusters already neh?

wicked notch
#

Yes, the classify shader is as efficient as I could make it

wispy spear
#

ah

#

ship it

#

provide debug/release binaries

#

let other frogs try it out

wicked notch
wispy spear
#

perhaps driver/hardware combos fuck with the results

wicked notch
#

This is the shader if anyone wants to take a look

wicked notch
frank sail
#

have you profiled UE with nsight yet

wispy spear
#

or power states

wicked notch
#

Opening UE5 at the speed of light

frank sail
wispy spear
#

ram really got cheap af btw

#

i was wondering if i should go from 64 to 128 too 🙂

raven orchid
wicked notch
#

yes

wispy spear
#

uqawimogom 🙂

#

explains a lot heh

raven orchid
#

I wonder if the atomic ops are eating most of its time

frank sail
#

inb4 bank conflicts

#

not sure where that'd appear in the profiler. maybe under VRAM

wicked notch
#

I'd say vertex fetch and transform are eating up my precious milliseconds

#

Unfortunately I have no idea what LGSB is KEKW

frank sail
#

is there something you can do to simulate fetching fewer vertices/less vertex data to see if that helps perf

wicked notch
#

Yes, fewer verts does help

frank sail
#

what about atomics

wicked notch
#

64 verts / 64 prims meshlets are the best

#

64 / 126 recommended by nvidia is kinda trash

frank sail
#

try replacing atomics with regular ops and see if perf changes (ignoring the brokenness)

wispy spear
#

LGSB = LarGe String Buffer

wicked notch
#

removing the atomics quite literally changed nothing bleakekw

#

Damn alright

#

I guess AMD needs to invent infinity cache except for memory bandwidth

#

Infinite Memory Bandwidth (AMD patent pending)

frank sail
#

How compact are your vertices

#

Maybe you can shave some bytes off

wicked notch
#

I can

#

I have purposely left quantization for later™️

#

But I guess it's my only shot at better performance

wispy spear
#

out of curiosity i tried to open various nsight docs and tried putting LGSB into their searchboxes, 0 hits

#

which is quite weird

wicked notch
#

C+F: Long Scoreboard

#

It's basically "wait for this memory load to complete"

wispy spear
#

ah

#

appendix wouldvebeenve nice

wicked notch
#

It's a pain in the arse to search for docs for NV

#

Yes

wispy spear
#

can you press F1 in that LGSB column/row within nsight

#

or is there a ? button in the system menu like good old win9x windows had

wicked notch
#

I hovered over LGSB and it said "Long Scoreboard" yes

wispy spear
#

ah lol

#

showing myself out

wicked notch
#

Nah I didn't notice myself at first bleakekw

#

Docs should be easily available

wicked notch
#

when the classification stage takes more time than the rasterizer

#

I love absolute non sensical results

wispy spear
#

did you talk to devsh yet?

wicked notch
#

He's not active recently so I couldn't catch him

wispy spear
#

dont be afraid to boop him, im sure he has a few bits and bops to say about this

wicked notch
#

By the way something is 100% amiss with meshlets

#

There is no way in hell these 3 meshlets (remember, they have 64 triangles inside them) have a triangle MORE than 32 pixels wide

#

Red is hardware, Blue is software btw

#

Unless they are part of the same meshlet, which again makes no sense because meshlets should be continuous

frank sail
#

inb4 bad perf is due to a bug causing you to render 64x more stuff

wicked notch
#

imma open an issue with meshoptimizer

#

Nevermind

#

zeux doesn't agree with me

#

mfw
I’m not sure I agree with this being a mistake. For mesh shading implementation on desktop my assessment has been that under filling meshlets results in more efficiency loss than the extra culling is worth.

#

At least I have peace of mind anything I've done so far wasn't wrong

wicked notch
#

Aight I guess I'll make my own custom meshopt based on the actual meshopt

#

It's funny, anything made by zeux I end up making my own fork
gltfpack => I have my own
meshopt => soon™️

#

I guess our opinions are very different bleakekw

frank sail
#

lvstri trying to render 1 (one) mesh

wicked notch
#

Anything I do does not conform to any generally agreed upon standard I have different requirements for everything bleakekw

#

Feels like I'm reinventing the universe

#

I'll follow deccus suggestion and take a break

#

Jaker you now have an employee

#

I expect paychecks

frank sail
#

you could also implement a different part of the renderer if you don't want to break too hard

#

like shading

#

but yeah don't burn yourself doing too much schtuff

wicked notch
#

perhaps I'll port what I had back in opengl yes

#

Shadow maps took 4ms to render with culling back then

#

I wonder how much that improves with this ridiculously optimized pipeline

frank sail
#

5ms bleakekw

#

will be interesting to see though froge

wispy spear
#

and speak to devsh

#

since wpotrick is rather useless 😄

#

and or write about your findings and problems

#

publish it somehow

wicked notch
#

potrick has helped me a lot with instancing, he's good

wispy spear
#

(i know, i was just kidding)

wicked notch
#

Also I think he just wants me to join his cult bleakekw

distant lodge
#

devsh has done full on meshlet stuff? I know he mentioned doing a visbuffer

wispy spear
#

yeah the so called brainworm

wicked notch
#

epic, I'll pong him tomorrow I guess

wispy spear
#

doesnt hurt