#Iris - A Journey through OpenGL and beyond to learn Graphics

1 messages · Page 19 of 1

glass sphinx
#

huh the fuck

#

for me per primitive atteibutes work fine

buoyant summit
#

that doesn't work like that in all companies 💀

wicked notch
buoyant summit
#

if slang was intel and intel had a driver bug they'd probably add a workaround to the compiler

wicked notch
#

so gl_PrimitiveID and gl_CullPrimitiveEXT are not marked

glass sphinx
#

what is primitive id

wicked notch
#

gl_MeshPrimitivesEXT[id].gl_PrimitiveID

#

I use this to output primitive id to visbuffer

glass sphinx
#

what does it to over an index

#

i always just passed a uint

wicked notch
#

it's probably the same thing, one is per vertex and one is per primitive

#

idk

#

gl_CullPrimitiveEXT tho bleakekw

glass sphinx
glass sphinx
#

its just an input for some shader stages

glass sphinx
#

in geo shaders its an output

#

no the prim id

wicked notch
#

oh you mean gl_PrimitiveID

glass sphinx
#

yee

wicked notch
#

I mean

#

perprimitiveEXT out gl_MeshPerPrimitiveEXT {
  int  gl_PrimitiveID;
  int  gl_Layer;
  int  gl_ViewportIndex;
  bool gl_CullPrimitiveEXT;
  int  gl_PrimitiveShadingRateEXT;
} gl_MeshPrimitivesEXT[];``` vk spec says this is writeonly
glass sphinx
#

hmm

#

cool that its documented in zhe mesh shader ext tho

buoyant summit
#

tbf I also don't really get the point of mesh shader gl_PrimitiveID beside some weird compat with frag shaders that were consuming/expecting gl_PrimitiveID

glass sphinx
#

yea

buoyant summit
#

btw I found a use case for secondary command buffers

#

you can defer decisions like loadOp, storeOp and image layouts until later point in time

glass sphinx
#

but cant you do that with normal cmd buffers too

buoyant summit
#

no

#

suspend/resume requires that loadOp etc match

#

and image layouts

glass sphinx
#

suspend and resume rendering needs the

#

uuuh

#

ok

#

intresting

buoyant summit
#

tbh my abstraction is just kinda weird

#

but secondary cmdbufs turned out to be rather handy for it

glass sphinx
#

nice

#

not weird if it works

delicate rain
#

I never understood the secondary command buffers

#

I thought they were what work graphs are meant to be

buoyant summit
#

do you mean indirect commands

#

secondary command buffers are like normal command buffers except they need less information before you can record a draw

delicate rain
#

Ah okay I was heavily misinformed then

glass sphinx
#

i dont like the name command buffer

#

buffer too overloaded

#

command list better

delicate rain
#

I thought secondary command buffers could be dispatched from primary ones

buoyant summit
#

🥱

#

yes

delicate rain
#

Hmmmm

wicked notch
buoyant summit
#

me on my way to rename descriptor_buffer to descriptor_list

glass sphinx
#

NOW

wicked notch
#

I have university account

glass sphinx
#

ok good

buoyant summit
#

tbh one thing I'd like if memory management for cmdbufs was more explicit

#

CommandPool feels ugh

glass sphinx
#

true

#

bikeshed tho

buoyant summit
#

can I just please provide a callback that driver will call every now and then froge_sad

buoyant summit
#

btw

#

VK_EXT_nested_command_buffer is a thing apparently

#

so you can recurse with command buffers very deeply

#

idk who asked for that, but

glass sphinx
#

lool

wispy spear
#

John Daxa probably

delicate rain
#

He actually is called "Join Daxa"

buoyant summit
#

Jane Daxa

wispy spear
#

my bad i meant Jérôme Daxa

wicked notch
#

it's been sitting there for 2 years

#

thank you John Khronos bleakekw

loud crag
#

in the default implementation yeah but I have a custom one which dynamically creates a set with all required images

#

hmm no that works

#

all my textures are SRGB generally

frank sail
pale horizon
#

Also you can open ImGui's style editor in DemoWindow and color pick the colors to see if they match
Simple sRGB-cope implementations usually have subtly wrong colors

#

(especially FrameBg, FrameBgHovered etc.)

loud crag
primal shadow
#

@wicked notch have you started software raster? Do you know why nanite forces HW raster for triangles that need to be clipped to the screen?

wicked notch
#

clipping is faster in HW

primal shadow
#

Isint it just a min/max of each screen space vertex to clamp to the screen bounds?

wicked notch
#

but tbh I didn't think about it too hard, I just acceoted what Brian said KEKW

wheat haven
delicate rain
#

yeah exactly

primal shadow
#

Ah I see

#

Oh wait I misunderstood what this tutorial I'm reading is doing

#

They were clipping the aabb for the triangle, not the triangle vertices

delicate rain
primal shadow
#

Thanks!

wispy spear
delicate rain
#

it is a lecture slides thingy

wispy spear
#

still

#

same

delicate rain
#

hmm I don't see the issue here. It is a lecture 10 - which is about GPU architectures (architektury in Czech) part II

frank sail
primal shadow
#

Does anyone understand what nanite talks about when it comes to solving for the x interval for software raster?

#

I suppose I should read the PDF they link as reference [71] 😛

primal shadow
#

Nope I read through cure's GitHub project and can't actually find what they're doing

primal shadow
#

@wicked notch how do you decide between SW/HW raster per cluster? Nanite presentation says all 3 triangle edges < 32 pixels long. But uhh, how do you determine that at the cluster level? Compute some kind of average triangle edge length per cluster?

wicked notch
#

I personally just do aabb to viewport space

#

it's good enough for me

primal shadow
#

And then what, SW rasterize anything with a small viewport area?

wicked notch
#

yes

primal shadow
#

Makes sense, thanks

wicked notch
primal shadow
#

I'm almost ready to do software raster, I just need some final wgpu changes to do image atomics. Going to use buffers in the meantime.

#

I also need to figure out how to rework my culling code to feed into the SW/HW raster. Probably switch to two atomic lists of cluster IDs.

#

And then indirect dispatched on the SW raster / hw index buffer write.

wispy spear
#

isnt that what potti said?

#

the frogs over there fiks shit rather kwik

glass sphinx
#

yes

wispy spear
#

thats how we will get LVSTRINV soon

frank sail
#

LPSTRI

wispy spear
#

that reminds me of pastries

wispy spear
#

luigi? i think you should try to get into novideo

#

after school

#

just like pixel went to valve

#

having a frog at important spots in the industry helps to fight John Khronos

wicked notch
#

you're asking me to go to the dark side

wispy spear
#

explicitly

buoyant summit
#

why is nv dark side

wicked notch
#

nv bad amd good

#

isn't that how it works

buoyant summit
#

no

#

everyone bäd

wicked notch
#

true that

frank sail
#

my megacorp is better than your megacorp

delicate rain
#

NV money

#

Go nv

#

I love monopoly

#

Long live nv

delicate rain
#

@wicked notch I move here to not pollute Vulkan

#

I don't get it

#

do you do one dispatch per task shader workgroup?

#

or per subgroup?

#

(my actual knowledge of mesh shaders is limited)

wicked notch
#

workgroup is 128
subgroup is 32
when I call vkCmdDrawMeshTasksEXT(1, 1, 1) I dispatch one task shader to cull 128 meshlets

#

if all of those survive I will have 4 mesh shader commands (EmitMeshTasksEXT()) each with a workgroup dispatch size of 32

#

what I'm trying to do is reduce the number of EmitMeshTasksEXT calls so that NV command processor doesn't commit sudoku

delicate rain
#

okay so you still cull 32 meshlets per task subgroup?

#

uh

#

I'll just tell you what we do in Tido

wicked notch
#

yes

#

32 per subgroup, 128 per workgroup

delicate rain
#

each task shader invoc culls 32 meshletts, we get base offset, subgroup offset (I think) and survivor bitmask

#

then we dispatch mesh shader (32 threads) for each surviving meshlet

loud crag
#

yeah survivor bitmask sounds like a good idea

wicked notch
#

what's the survivor bitmask for

delicate rain
#

to know what meshlet mesh shader works on

wicked notch
#

you already know that don't you

loud crag
#

replaces your deltaIDs from your payload

delicate rain
#

nth mesh dispatch looks for nth set bit in the bitmask

wicked notch
#

ah

#

I see

delicate rain
#

your payload shrinks 32 times

#

(or 8 if you use weirdo uints)

wicked notch
#

so you do an exclusive workgroup bitcount?

#

if there even is an intrinsic for that

#

maybe it's only subgroup bitcount

delicate rain
#

we cooked with Patrick

#

wait

#
func wave32_find_nth_set_bit(uint mask, uint bit) -> uint
{
    // Each thread tests a bit in the mask.
    // The nth bit is the nth thread.
    let wave_lane_bit_mask = 1u << WaveGetLaneIndex();
    let is_nth_bit_set = ((mask & wave_lane_bit_mask) != 0) ? 1u : 0u;
    let set_bits_prefix_sum = WavePrefixSum(is_nth_bit_set) + is_nth_bit_set;

    let does_nth_bit_match_group = set_bits_prefix_sum == (bit + 1);
    uint ret;
    uint4 mask = WaveActiveBallot(does_nth_bit_match_group);
    uint first_set_bit = WaveActiveMin((mask.x & wave_lane_bit_mask) != 0 ? WaveGetLaneIndex() : 100);
    return first_set_bit;
}
#

returns the position of nth set bit in a bitmask, uniform for entire subgroup

wicked notch
#

why would you do this instead of just using subgroupBallotExclusiveBitCount bleakekw

delicate rain
#

I think it's not in slang

wicked notch
#

ah so slang cope

#

epic

loud crag
delicate rain
#

how do you know what thread to broadcast from

#

like the thread itself knows sure

#

but how do the others know

wicked notch
#

interesting question

#

intuitively, each subgroup invocation calculates its own exclusive bit count

#

so instead of a ballot exclusive bit count, you do a regular bit count on the mask

delicate rain
#

well you want a subgroup per meshlet

#

so all 32 threads should agree on the meshlet

#

no?

#

or do you have a different mapping

wicked notch
#

I don't, you are indeed correct

#

4am brain syndrome

delicate rain
#

the exclusive/inclusive bit count would just replace the WavePrefixSum and two lines above

wicked notch
#

so it's a workgroup bit count

delicate rain
#

the ballot and min you still need

wicked notch
#

yeah it's basically what you and patrick cooked

#

your code is now mine

delicate rain
#

yes yes

#

I will need soft shadows soon, so let's call it an exchange 🤝

glass sphinx
#

nv recommends 32 still

#

i also always lost perf on larger groups

#

the survivor list reduced the lsb pressure quite a bit that was very nice

#

larger groups are probably better when you have larger payload

#

i see now why you did it

wicked notch
#

the idea was to reduce the number of invocations needed to cull and draw

buoyant summit
wicked notch
#

can I at least expect gl_SubgroupID to work as I would expect it to work?

#

when workgroup size is multiple of subgroup size

glass sphinx
buoyant summit
buoyant summit
glass sphinx
#

nice

loud crag
buoyant summit
#

slang your beloved (NOT MINE THOUGH)

wicked notch
#

holy

#

now we only need VMM and we're golden

glass sphinx
#

they just implement it

#

i love them

glass sphinx
#

^ you when a slang release drops (at least this one)

wicked notch
#

@glass sphinx do you have docs on spirv intrinsics for slang

buoyant summit
#

trying to review a slang patch and it makes my brain boil slightly

glass sphinx
buoyant summit
#

where slang discord/irc..

#

-emit-ir, handy

buoyant summit
#

I think and proof read a lot before posting

wispy spear
#

hehe

#

better than not proof reading at all ❤️

frank sail
#

I wonder how nano copes with email

buoyant summit
frank sail
#

Same

buoyant summit
#

but also by not sending any emails often

frank sail
#

I hate email

wispy spear
#

yeah it sucks

buoyant summit
#

agree but I like receiving notifications on my issues/MRs/whatever in a centralized box

#

very handy

frank sail
#

Sure but I mean for direct communication

buoyant summit
#

yeah for direct communication it kinda sucks

frank sail
#

Especially when mfs splinter threads and then the inbox becomes impossible to navigate bleakekw (might be a client/skill issue idk)

buoyant summit
#

yes, skill issue

wispy spear
#

plus all the signature junk polluting every thread

frank sail
#

what client is this

buoyant summit
#

thunderbird

frank sail
#

in Outlook it's just a list in chronological order, at least by default

#

ass

buoyant summit
#

never used outlook

frank sail
#

I use it for work unfortunately

#

I did use Thunderbird once. Perhaps I should use it again

buoyant summit
#

considering outlook is supposed to be backed by a big chungus corporation it probably has comparable functionality

#

so mayhaps just do a bit of googling

frank sail
#

Idk what to search for I guess

#

I'll figure it out

buoyant summit
#

do you use folders

#

or do you just look at your inbox

#

create folder + setup a filter by something

frank sail
#

I have a bunch of filters and folders already

#

I just need to make it look nice

delicate rain
#

Damn email powerusers

frank sail
#

It's necessary when you get dozens of emails every day bleakekw

buoyant summit
delicate rain
#

I contribute to one repo and they have the GitHub actions setup to try and build after each commit for like 15 different setups. So after each commit you receive 15 emails of the build succeeding or failing

frank sail
#

hmm I only get emails when CI fails in my thing (Fwog)

delicate rain
#

Or maybe it's that, but their gh actions are broken so everything fails 🥸

#

I don't really read them, because it was just garbage every time I did

#

900k+ lines of compiler errors

finite quartz
glass sphinx
#

if end to end encription gets outruled at some point im safe

#

i trained all the ppl i know to decypher my shit

#

noone else will understand

primal shadow
#

My depth down sampling is missing some pixels on npot textures

#

And therefore the way I project bounding spheres to screen space is wrong. Ahhh.

wicked notch
#

it's what one does to avoid pagefaulting bleakekw

#

though tbh I don't think adding a page request is too bad

delicate rain
#

hmm with PCF this is easy, you just request the PCF region to be in the same clip, but I have no clue about how to do it with pcss

wicked notch
#

same thing except it's max(blockerSearchRadius, maxFilterRadius)

delicate rain
#

right, but those can be massive no?

wicked notch
#

oh

#

yes bleakekw

delicate rain
#

yeah...

#

cope it is

wicked notch
#

I have to try pagefaulting

#

it's really just an atomicAdd, how bad can it be

delicate rain
#

I don't follow

wicked notch
#

when filtering, if a page is not allocated you just emplace an allocation request in the list

delicate rain
#

to be drawn next frame?

wicked notch
#

ye

faint crane
#
wicked notch
#

@primal shadow you're building your own version of SPD right?

#

can you pin me the discussions you had with devsh, I can't find them

primal shadow
wicked notch
#

rip

primal shadow
#

Depth down sampling for the initial pass is hard though. I messed up and need to fix it.

wicked notch
#

halving time for hzb copy + reduce, 'ery nice

#

150us -> 80us

#

now I can spam this for all 16 clipmaps

glass sphinx
#

that will be 1.3 ms

wicked notch
#

yes bleakekw

glass sphinx
#

why dont you do single pass downsampling

wicked notch
#

because amd cannot comprehend the existence of operating systems other than windows

#

this is kinda sorta single pass

#

except it's 3 pass KEKW

glass sphinx
#

oooff

#

wait but you can just copy the tido single pass downsampler

#

it just doesnt properly work on the bottom and right edge if you have a non pod res

wicked notch
#

does it work for stuff that is non square

#

like 2048x1024

glass sphinx
#

yea

wicked notch
#

nice

wicked notch
#

assume it's just single channel float

glass sphinx
#

i was thinking of rescaling the lowest mip to some pot size

glass sphinx
#

on 4080

#

fully memory bound

wicked notch
#

hmm

glass sphinx
#

wait no its less

#

at 1440p irs 20microsecs

wicked notch
#

for me it takes 30 microseconds to copy time image and 30 microseconds to fully downsample it

glass sphinx
#

tidos should be 11.5 microseconds for 2048*1024

#

on 4080

wicked notch
#

does that include the initial reduction

glass sphinx
#

no

wicked notch
#

ok so your thing is 2.9x faster

glass sphinx
#

arguably i should remove the writeout for the lowest 2-3 mips not just mip0

#

culling on that fine scale is unnecessary

#

also slower even potebtially

#

i need to test it again

delicate rain
wicked notch
#

btw saky

#

hear me out

#

when drawing the vsms

#

I bind temporary virtual resolution depth texture to get the sweet early Z

primal shadow
#

@glass sphinx do you have code to share? I realized my down sampling does not work for npot textures :(.

wicked notch
#

now let's say I build HZB out of this temporary depth texture

#

in the initialization step, I query for the active page table, and write 0 if page is inactive, or the depth if it is active

wicked notch
#

then I reduce the thing and cull

#

culling outputs a bitmask of visible meshlets

#

this repeats for each clipmap

#

downside is subchannel switch

delicate rain
#

uhh

primal shadow
delicate rain
#

I don't fully follow

#

also it makes it so that you have to barrier between each clip draw

#

16 barriers each frame

#

yucky

wicked notch
#

no I don't need barriers

#

or yes I do

#

but only for the depth texture

#

transitioning from undefined because I discard the results

primal shadow
# wicked notch culling outputs a bitmask of visible meshlets

I switched to 2 bits per meshlet btw. 0 = not eligible (wrong lod, instance culled, failed frustum cull), 1 = first pass draw (passed last frame occlusion cull), 2 = second pass candidate (failed last frame occlusion cull), 3 = second pass draw (passed current frame occlusion cull)

delicate rain
wicked notch
#

yes

delicate rain
#

yeah

wicked notch
#

but it's accelerated so it doesn't count

delicate rain
#

yeah but the draws don't overlap

#

thats why I mainly don't like the temp Z buffer

wicked notch
#

they don't but I'm hoping for hiz culling to make up for the difference

primal shadow
#

Also I asked this question earlier and didn't get a response, do we really need to do an explicit frustum cull for each meshlet? We have to occlusion cull anyways, which involves projecting the culling sphere to a screen space aabb. We can just check if it's valid, no?

glass sphinx
delicate rain
#

like you draw the biggest clip, and cull the lower one against hiz from the biggest clip?

#

ie draw clip 8 - build hiz of 8 - cull clip 7 agains 8 - draw clip 7 - build hiz 7 ....

wicked notch
#

no

#

it's like this

wicked notch
#
for clipmap in clipmaps {
  Clear(tempDepthBuffer);
  Draw(tempDepthBuffer, previousFrameVisibilityMask[clipmap.Index]);
  Copy(tempDepthBuffer, clipmap.HZB); // also performs an initial reduction, checks whether the current depth tile
                                      // is active in the page table, if it isn't active the depth written is going to be 0
  Reduce(clipmap.HZB);
  Cull(clipmap.HZB, currentFrameVisibilityMask[clipmap.Index]);
  Barrier();
}```
delicate rain
#

right, but I don't see how this is different than doing it without temp hiz?

#

you can just as well dras previousFrameVisibilityMask into the VSM itself

#

reduce the VSM

#

and cull against that

#

you don't really need the temp z for this no?

#

or am I missing somehting again bleakekw

wicked notch
#

now that I think about it you're right I don't

#

I can just make a dispatch over the full virtual resolution and each thread checks for the virtual page table, gets physical texel and write that (or 0 if page isn't active)

#

this way draws overlap too

delicate rain
#

and we can do that inplace in the physical memory itself, (we just make the physical memory have mips, that way you can reduce over the physical memory)

wicked notch
#

ye but I don't understand how to cull against the actual VSM with mips bleakekw

delicate rain
#

I mean there will be some index math but I don't see how it would not be doable

#

conceptually it is all the same no?

wicked notch
#

I mean

#

yeah

#

but it feels too clamplicated compared to just having a good ole HZB

delicate rain
#

It will be a bit more involved yeah

wicked notch
#

thinking about it, a quad could span 4 different physical pages

delicate rain
#

yes

wicked notch
#

(one per corner)

delicate rain
#

so you need to go from NDC -> VSM_PAGE -> PHYSICAL_STORAGE

#

or, over some mip cutoff just NDC -> VSM_HIZ

#

it's just an indirection

#

(I'm saying it super nonchalantely, as like I won't run into 50 bugs trying to implement this)

wicked notch
#

too real

#

I'll try out the ez version

frank sail
# buoyant summit

btw I found the button to enable this view in outlook (show as conversations) froge_love

distant lodge
buoyant summit
#

@wicked notch make a blog

distant lodge
#

please make a blog, I don't wanna read twitter clones

buoyant summit
#

posting short messages can be easier for the author to write

hallow cedar
#

jokes on you i procrastinate doing either

distant lodge
#

post them into your text editor

#

and then upload it to a web host

buoyant summit
#

so it's a choice between lustri not writing anything (and you thus not reading) vs reading something that's perhaps not very quality

distant lodge
#

one weird trick

buoyant summit
#

leave it up to lustri to decide

distant lodge
#

true, I should probably give in and make a memestodon account so I can continue passively trawling for GP info

buoyant summit
#

and you VILL be reading microblogs on twitter clones

#

if the poster chooses

buoyant summit
#

or other instances will block the instance you've picked which might reduce the quality of your experience

buoyant summit
#

@dull oyster didnotreadfully but why do you do this:


        vk::DeviceAddress vertexBufferAddress = (vk::DeviceAddress)-1;
        vk::DeviceAddress indexBufferAddress = (vk::DeviceAddress)-1;```
#

why not just 0

#

0 is null & invalid

primal shadow
#

I use my own blog on GitHub pages. Took like an hour to setup.

dull oyster
wicked notch
#

I really should sleep because my eyes are failing me but the vsm HiZ expiment is almost done

wispy spear
#

allez dodo! 🦤

wicked notch
#
const uint cornerX = index & log2(position);
const uint cornerY = index >> log2(position);```
#

mmm yes

frank sail
#

are you doing two-pass hiz

wicked notch
#

just single pass rn

frank sail
#

because I think you either have to do that or make pages active for two frames so geometry can render properly

wicked notch
#

that does make sense

frank sail
#

I tried one-pass hi-z and didn't think of that issue, then it was broken and I deleted it all froge_bleak

wicked notch
#

ye I'm willing to accept some artifacts for now

#

I'll do two pass after this works properly and I get the perf boost I'm expecting

wicked notch
#

buckle up because I'm about to create a 2048x2048x16x4 image

wicked notch
#

ah yes

#

1ms spent on building the HZB bleakekw

delicate rain
#

Does it at least go brr when drawing?

wicked notch
#

I will test soon

#

meanwhile

#
float SampleVirtualShadow(in uvec2 position) {
  const uint power = findMSB(VIRTUAL_SHADOW_PAGE_SIZE);
  const uvec2 virtualPagePosition = position >> power;
  const uint virtualPage = imageLoad(g_VirtualShadowVirtualPageTableImage, ivec3(virtualPagePosition, gl_WorkGroupID.z)).x;
  if (!VirtualShadowIsPageBacked(virtualPage)) {
    return 0;
  }

  const ivec2 physicalTexelCorner = VirtualShadowCalculatePhysicalTexelCorner(virtualPage);
  const ivec2 physicalTexel = physicalTexelCorner + ivec2(position & uvec2(VIRTUAL_SHADOW_PHYSICAL_PAGE_SIZE - 1));
  return uintBitsToFloat(imageLoad(g_VirtualShadowPhysicalMemoryImage, physicalTexel).x);
}

RetinaGroupSize(16, 16, 1)
void main() {
  const uvec2 position = gl_GlobalInvocationID.xy;
  const uvec2 virtualPosition = position << 1;
  const vec4 samples = vec4(
    SampleVirtualShadow(virtualPosition + uvec2(0, 0)),
    SampleVirtualShadow(virtualPosition + uvec2(0, 1)),
    SampleVirtualShadow(virtualPosition + uvec2(1, 0)),
    SampleVirtualShadow(virtualPosition + uvec2(1, 1))
  );
  const float v = min(min(samples.x, samples.y), min(samples.z, samples.w));
  imageStore(g_OutputImage, ivec3(position, gl_WorkGroupID.z), vec4(v));
}
``` can I make this better in any way
#

I stall on TEXTHR though so I doubt that bleakekw

delicate rain
#

On phone

#

Will check when I'm back on pc

#

2ms shadowmap draw seems fine no?

#

(from the capture )

wicked notch
#

that's without hzb, I will make culling go brr now

delicate rain
#

This is bistro 16 clips?

wicked notch
#

nono sponza

#

because it loads faster so I can test without waiting kekkedsadge

#

mfw no async load

delicate rain
#

Would be pog if it was bistro

wicked notch
#

stay tuned

#

soon™️

delicate rain
#

But fun when it works

#

Launch times are instant

#

Hmm, how much faster is Sponza compared to bistro?

wicked notch
#

a lot

#

but it's not comparable to regular raster either way

#

I think I can make the copy go faster if I abuse the fact that page size is 128

#

I just allocate a ton of shared memory and put the samples there

delicate rain
#

Okok, was just wonderin

wicked notch
#

I reduce memory traffic

#

memory subsystem is happy

#

I am happy

#

copy takes literally 3x less time

#

hell yeah

delicate rain
#

At the mere cost of 380 megabytes of VRAM we get hiz

wicked notch
#

less than what bistro would take in an acceleration structure KEKW

#

plus this is the dumb shitty naive method

#

if we reduce the VSM itself we get infinite power

delicate rain
#

The if is the important part bleakekw

wicked notch
#

I believe in you

delicate rain
#

I'm slowly burning out

#

Uni is killing me

wicked notch
#

ight moment of truth

delicate rain
#

Speed

wicked notch
#

I'm rendering at the blistering pace of VK_ERROR_DEVICE_LOST millseconds per frame

delicate rain
#

I'm curious btw, are validation layers useful to you?

wicked notch
#

somewhat

delicate rain
#

Recently I can't really do anything other than shader print with them

wicked notch
#

I've started to ignore sync val

#

yes

#

god bless debugPrintf

delicate rain
#

Yeah, GPU based just crashes or misdiagnoses so the app doesn't run properly

#

Sync doesn't get bindless

#

And everything else is done by daxa

wicked notch
#

holy mother of all pog

#

it is working

#

ladies and gentlemen

#

800 microseconds raster on bistro

delicate rain
#

Letsgo

#

Damn that's fast

wicked notch
#

it takes longer to build the HZB bleakekw

delicate rain
#

If we combine it with hpb cull and caching it will be light speed

wicked notch
#

this is already hpb

delicate rain
#

Do you know if it actually works?

wicked notch
#

I write far depth into non backed pages

delicate rain
#

Like do you have Shadows?

delicate rain
wicked notch
#

oh yeah I forgot to do that KEKW

#

let me write basic sampling one sec

delicate rain
#

Crossing my fingers

wicked notch
#

btw I still get 14ms in this damned spot

#

at this point I have no idea why bleakekw

delicate rain
#

Uhhhhh

#

Where is that?

#

Under some of the bushes?

wicked notch
#

bushes

#

ye

delicate rain
#

I'll try if I get that too soon

#

Do you have alpha discard?

wicked notch
#

nop

delicate rain
#

Hmm weird

loud crag
delicate rain
#

Btw is it two pass?

wicked notch
#

one pass

delicate rain
#

So you just keep depth from previous frame and cull against that or?

wicked notch
#

no I use previous frame visibility mask

#

I do that because it's easier to switch to two pass this way

delicate rain
#

Previous frame visibility mask to do what?

#

(idk how single pass culling works) agonyfrog

wicked notch
#

previous frame visibility mask to cull current frame meshlets

#

it's a bogus approach

#

but it makes it easier to do two pass

#

Like I only have to do another culling pass and xor the visibility masks together

delicate rain
#

Hmm okay what is visibility mask? I thought it's just a bit mask of visible/notvisible for each cascade?

wicked notch
#

something to mark meshlets as visible/not visible

#

each bit encodes visibility of a single meshlet

delicate rain
#

How do you cull against that

#

Huuuh I don't understand

wicked notch
#
bool IsMeshletVisible(uint meshletInstanceIndex) {
  const uint maskIndex = meshletInstanceIndex >> 6u;
  const uint bitIndex = meshletInstanceIndex & 0x3fu;
  const uint64_t mask = RetinaDereference(g_VisibleMeshletBuffer)[maskIndex];
  return (mask & (uint64_t(1) << bitIndex)) != 0;
}```
#

this is in the task shaderino

wispy spear
#

3x fu 😛

delicate rain
#

Tido has super sampling froge

wicked notch
#

when I cull

#
void main() {
  const uint meshletInstanceIndex = gl_GlobalInvocationID.x;
  if (meshletInstanceIndex >= u_MeshletCount) {
    return;
  }
  ...
  const uint meshletVisibilityMaskIndex = meshletInstanceIndex >> 6u;
  const uint meshletVisibilityMaskBit = meshletInstanceIndex & 0x3fu;
  if (IsMeshletVisible(aabb, viewInfo, transform)) {
    atomicOr(RetinaDereference(g_VisibleMeshletBuffer)[meshletVisibilityMaskIndex], uint64_t(1) << meshletVisibilityMaskBit);
  } else {
    atomicAnd(RetinaDereference(g_VisibleMeshletBuffer)[meshletVisibilityMaskIndex], ~(uint64_t(1) << meshletVisibilityMaskBit));
  }
}
#

let me take a moment to curse at shaderc

#
Failed to compile shader 'D:/Dev/CLion/Retina/src/Retina/Sandbox
/Shaders/GBufferResolve.frag.glsl': shaderc: internal error: compilation succeeded but failed to optimize: ID '128[%g_Vi
rtualShadowInfoBuffer]' defined in block '9[%9]' does not dominate its use in block '258[%258]'
  %258 = OpLabel``` fuck you
wispy spear
#

i understand 0, but its utterly fascinating, and im not trying to be sarcastic or anything. its froge_love

#

maybe we lure the slang peeps onto our server and chain them somewhere in the basement, so that they can immediately fix those things

delicate rain
wicked notch
#

shadows do work with culling

#

and they're blazing fast to raster

#

VSM has been conquered

wicked notch
#

if meshlet passes both frustum and hiz test it is set as visible

delicate rain
#

okay so you draw meshlets visible last frame, build hiz, draw everything again and cull against hiz?

wicked notch
#

correct, minus the "draw everything again"

delicate rain
#

draw meshelts visible last frame, build hiz and...??

wicked notch
#

cull for next frame

delicate rain
#

when do you draw what was not visible last frame

wicked notch
#

don't try too hard to understand this method, it's garbage KEKW

delicate rain
#

why is it so hard for me to understand culling man

#

I stg I have huge skill issue

wicked notch
#

I'm just not doing that rn for shrimplicity

loud crag
delicate rain
wicked notch
#

ye that's when you xor the masks

delicate rain
#

for a frame I guess, because it is visible this frame so you draw it next frame

wicked notch
#

two pass occlusion culling works like this

#
Draw(visibleLastFrame);
BuildHZB();
CullAndXor();
Draw(disoccludedCurrentFrame);```
#

I do this right now

Draw(visibleLastFrame);
BuildHZB();
Cull(); // for next frame```
delicate rain
#

right so you do have one frame of delay

wicked notch
#

yes

#

that's why what I'm doing is bad

delicate rain
#

okay now I get it

#

sorry for the confuslment

wicked notch
#

issa ok frogeheart

#

pure bliss

delicate rain
#

yeah this is awesome

wicked notch
#

1ms

delicate rain
#

what gpu btw?

wicked notch
#

3070

hallow cedar
#

inb4 4090

#

aw

#

in after

delicate rain
#

hmmm

#

I like this a lot

wicked notch
#

now potrick must figure out a way to make hzb build go brrr

#

because currently that takes more than VSM raster KEKW

delicate rain
#

btw do you use your heuristic or the original Jaker one?

#

for clip selection

wicked notch
#

I use mine because it's shrimpler

#

in practice it should be equiv

delicate rain
#

I keep running out of resolution on the 0th clip on a 2k monitor bleakekw

wicked notch
#

with my heuristic?

delicate rain
#

I need to turn my 0th clip world size to 8 meters to work

delicate rain
#

look at little sneaky guy

glass sphinx
#

eeeeviiiiill

wicked notch
#

potrick

#

make hzb build go brr

glass sphinx
#

livstris neuron is glowing

#

i did too much irl to program in free time last weeks

delicate rain
#

I wonder if better culling makes it faster to not cache at all

wispy spear
glass sphinx
#

@wicked notch we can prob just not erite the lowest mips for hiz

#

maybe even only on page level

#

if we do it at page level it will be like 10 mics for all clips

#

idk how fucked the culling will be with that tho

wicked notch
#

ye I think the real strat is just doing reduction on the VSM itself

#

per page, that is

glass sphinx
#

tru

#

two pass

#

hmmhmm

wicked notch
#

I think it can work, because it's effectively the same thing, just the sampling gets more convoluted

glass sphinx
#

yes

#

wmart

#

its getting complicated

astral field
#

forgefumbsup issues fix everything

wispy spear
astral field
wispy spear
#

: )

astral field
#

bleakekw oh, it wasn't french

wispy spear
#

i dont remember your nickname but you fly the team-effort-d3d11 role 🙂

#

or you are the 8bit guy

astral field
#

I'm 8 bit, yea

wispy spear
#

ah

astral field
#

OH you changed your name

wispy spear
#

you changed your whole account

astral field
#

nah, only username

#

8bit->nibble

#

half the bits

wispy spear
#

nibble is 4 bits

astral field
#

half the intelligence

wispy spear
#

fair

astral field
#

i had no idea you changed user, thought you disappeared

wispy spear
#

ill never disappear

astral field
#

gpu dev's never disappear, only manifest differently

wispy spear
#

im not a gpu dev though

astral field
wicked notch
#

I ponder

wicked notch
#

if the first clipmap is smol

#

frametime goes up a lot in that goddamned spot

#

near the bushes that is

#

it's fine literally everywhere else

#

I need to inspecc more

#

renderdoc really hates my 2k^2 x16 layers x12 mips image though kekkedsadge

#

so here is a partially reduced version of the first clipmap in that area

wispy spear
#

could that be overdraw in the bushes killing the perf?

wicked notch
#

it should be mitigated by hiz

wispy spear
#

maybe debug draw the bushes like potti did few months ago

#

frozen frustum and fly around the bush

#

to see if it do hiz or hiznt

wicked notch
#

oh you sparked an idea in me

#

page heatmap

#

brb

#

renderdoc is sometimes bogus though smh

frank sail
delicate rain
#

Why do we come up with the coolest of ideas when the articles dealing stares us down???

wispy spear
#

some people work better when there is pressure

delicate rain
#

I like to blame my weak discipline

wispy spear
#

out comes ze diamonds : )

wicked notch
#

there is a singularity at the center of bistro

#

idk what that is bleakekw

astral field
#

found the problem at least KEKW

frank sail
#

tbf that's where most of the detail actually is

wispy spear
#

explains all the bushes

#

they were trying to hide some shit

wicked notch
#

still something must be going very wrong

#

how come hiz isn't able to cope

wispy spear
#

is it possible that the geometry there is just simply fucked (read not ideal) and they really couldnt just be bothered

#

mayhaps needs some work in blender

wicked notch
#

nah I see now what's going on

wispy spear
#

ah

wicked notch
#

and yes it is the geometry being horrid

#

but it's something worse

#

big meshlets

#

foliage is a completely disconnected mesh and meshoptimizer already doesn't care about spatial locality

#

so it just grabs whatever triangles it has available to build the meshlets, this makes the AABB huge

#

hence hiz can't cope

wispy spear
#

that means you need a carefully crafted scene to properly generate meshlets for all the newfangledisms

wicked notch
#

..or

wispy spear
#

fix meshoptimizer

wicked notch
#

I do nanite

#

as I promised

wispy spear
#

you have to help joker to write the vsm paper first

wicked notch
#

yes I haven't done much bleakekw

#

I'll redeem myself

frank sail
#

you were never obligated to help with the writing

wicked notch
#

bullshit

#

it is my duty as a member of the VSM team to do work

astral field
wicked notch
#

I must uphold that duty

frank sail
#

I invite you to read what we have so far on overleaf

wicked notch
#

I shall

frank sail
#

feel free to add comments, make edits, etc.

#

obviously 😄

wicked notch
#

I'll note down stuff I think I can work on

#

we use tido yes?

frank sail
#

yeah

#

code isn't due for a few more weeks so we are not focusing on that at all

wicked notch
#

gud

delicate rain
wispy spear
#

looks like LVSTRI blogs bout his college setup

#

its an italian keyboard layout but he tinkered with some keys

wispy spear
#

lustri i also just noticed your cmake installing deps thingy, thats neat 🙂

#

also also, you dont need to tell cmake about your headers, cpp is enough, when declaring a target add_executable/library

wicked notch
#

I think I do target_sources everywhere, maybe there's some stray header

wispy spear
#

set(RETINA_HEADERS

#

and you still use cgltf, i will tell @loud crag

wicked notch
wispy spear
#

oh did you perchance remove it already :3

wicked notch
#

mayhaps

#

I don't remember ever having that but 🅱️erhaps my brain is failing me

#

as it usually does

wispy spear
#

oh

#

looks like i was on an old commit p_Cry

#

.Entry is still called .Entry though hehe

#

man your code is so readable

wicked notch
#

is it

#

I still need to make rg to separate the passes

#

rn everything is in application bleakekw

wispy spear
#

i fink it is, still not sure about the C for classes, but i very much stole the S and E prefix too now 🙂 and the auto Foo() -> ism

wicked notch
#

does the thing compile for you on lunix btw

#

I think new GCC should be out

wispy spear
#

ah let me check

#

uh let me check that too

#

last time was 13.2.1

#

it still is

wicked notch
#

hm perhaps it was just clang18

#

why is gcc so slow to release smh

wispy spear
#

dlss commit is outdated it seems

#
make] fatal: Fetched in submodule path 'NVIDIAImageScaling', but it did not contain 35e13ba316c98eeecf16f37eae70ce88019911f6. Direct fetching of that commit failed.
[cmake] CMake Error at nvdia_dlss-subbuild/nvdia_dlss-populate-prefix/tmp/nvdia_dlss-populate-gitclone.cmake:62 (message):
#

it should be disabled on lunix anyway i suppose, perhaps with some autodetection and a message?

wicked notch
#

ye

wispy spear
#

trying clang 17.0.6

#

ah lol?

#

dlss cloning worked now

#

it cant find #include <vulkan/vk_enum_string_helper.h>

wicked notch
#

wot

wispy spear
#

hopefully not a new vksdk ;c

wicked notch
#

it's 275

#

but it's been there since the beginning of time I think

wispy spear
#

narf

#

im on 268

wicked notch
#

can you check in your sdk

#

is it actually missing

wispy spear
#

eh hang on

wicked notch
#

the string helper has been there since 1.2 at least

wispy spear
#

this is weird

#

vulkan/vulkan.hpp is coming from /usr/include/vulkan wtf

#
[deccer@rootfs ~]$ sudo pacman -R vulkan-devel
checking dependencies...
error: failed to prepare transaction (could not satisfy dependencies)
:: removing spirv-tools breaks dependency 'spirv-tools' required by glslang
:: removing vulkan-headers breaks dependency 'vulkan-headers' required by qt6-base
:: removing spirv-tools breaks dependency 'spirv-tools' required by shaderc
``` ;C
wicked notch
#

vulkan-headers methinks

wispy spear
#

i have no memories installing anything qt6y

wicked notch
#

kde I think is qt

wispy spear
#

yeah, but im an xfce fanboy

#

ok let me fiddle

wicked notch
#

issa ok, I'll fix linux building once and for all now

wispy spear
#

haha

#

ah

#

its gnuplot

#

and that stupid patchpanel for thingy pipewire

#

and obs : )

#

removing ffmpeg breaks dependency 'ffmpeg' required by firefox

#

nice

wicked notch
#

don't nuke your os KEKW

wispy spear
#

: )

#

good thing is its ez to reinstall

#

but why would firefox have a hard dependency to ffmpeg

#

thats new

#

important question is, if i remove firefox and reinstall it, will it rember my current 250 open tabs 😛

#

ok i have to clean the tabs first anyway... will do that first

#

and i should also start working on local build containers

wispy spear
#

down to 150 tabs : >

wispy spear
#

down to 14

frank sail
#

how is that misinfo?

#

headers aren't treated as individual translation units

loud crag
#

IDEs or analyzer tools want that and might break or fail to analyse the headers if you dont add the headers

frank sail
#

perhap

wispy spear
#

had no trouble with clion or vscode-cpp so far at least

#

in my cmake/openglstarted and other cppisms for now

primal shadow
#

Finished revamping my 2pass occlusion culling to work with LODs, and be a lot cleaner/better in general!

#

Reusing the previous frame depth pyramid, instead of explicitly tracking cluster visibility between frames

loud crag
#

@wicked notch you know if nvpro_pyramid works well on non-nvidia gpus? i’m thinking of using it, too, but reading comments in the source like „subgroupSize other than 32 is not tested, should work, message [email protected] if not“ is throwing me off a little

#

perhaps ffx spd is a better fit for me

primal shadow
loud crag
#

huh, are you sure?

#

the nvpro dispatcher seems like it can handle anything that is a multiple of 4 by default

#

that number is templated, though, so perhaps it supports other factors, too

primal shadow
#

It probably doesn't enforce power of 2 for the textures or generates

#

Which would prevent you from using mips of a single texture

glass sphinx
#

maybe i missunderstand what you do

#

but if you use last frames depth i assume you draw twice, one culled against last frame once culling against partial new frame

primal shadow
#

I'm doing exactly what nanite does

primal shadow
#

Started porting SPD to bevy. Non power of 2 is actually easy. The samples out of bounds will just return 0, which ends up ensuring conservative depth, at the cost of a poorer culling on the edges of the screen. Totally fine imo.

wicked notch
#

nvpro_pyramid does support NPOT

primal shadow
#

Almost finished with porting SPD!

loud crag
#

dlss be working beautifully

wicked notch
#

it's disabled

#

should probably do something to enable/disable it easily KEKW

wispy spear
#

you can use the already exiting short cut alt+f4 to reload the thing

buoyant summit
astral field
#

c++ classes froge_yeehaw

buoyant summit
#

slang has crappy classes

#

well it doesn't have classes, only structs, and those aren't crappy themselves, but the semantics for this in methods is hlsl-tier garbage

#

because this is basically inout

#

:barf:

astral field
#

make an issue gpAkkoShrug

primal shadow
#

Ok finished SPD port to Bevy/WGSL and using it in my nanite-impl 🙂

#

Much better perf

glass sphinx
astral field
#

since hlsl be hlsl

#

spirv is actually a this I think

glass sphinx
#

spirv doesnt have a concept of classes at all

astral field
#

not classes

#

but ptr's

#

glsl target also supports class stuff in slang

glass sphinx
#

does spirv have storage class - generic pointers?

buoyant summit
#

I don't like spir-v generic pointer tbh

#

or rather

#

I'm kinda suspicious about just throwing it into vk as is

astral field
#

they're good if good tooling could use them (vcc & slang one day)

#

although I don't know the extent to which they can be used (never checked up deeply), I only really know about restricted ptr's

glass sphinx
buoyant summit
#

and the generic pointers issue I opened a while ago is on a really slow cook

primal shadow
#

I'm hitting the match dispatch limit for culling workgroups 😬

#

memory also continues to be really annoying to allocate large amounts of

primal shadow
#
Caused by:
    In a ComputePass
      note: encoder = `<CommandBuffer-(3, 3, Vulkan)>`
    In a dispatch command, indirect:false
      note: compute pipeline = `meshlet_culling_first_pipeline`
    Each current dispatch group size dimension ([103039, 1, 1]) must be less or equal to 65536
#

I can do the same dumb trick I did for the other pass and make it a 3d dispatch I then remap to 1d in order to get more workgroups 😛

#

Or use bigger workgroup sizes ig. Currently it's 64x1x1.

wheat haven
primal shadow
craggy shale
#

because actual magic doesn't exist and "magic" means heuristics and heuristics suck

wispy spear
#

im also confused about cmake 🙂

#

one cmakelists does if (RETINA_ENABLE_PROFILER) the other one does if (${RETINA_ENABLE_PROFILER})

wicked notch
#

I think it's equivalent

#

but ye I should change this

wispy spear
#

i cant it to work either

#

i stole the profiler one

#

no matter what i set or option with ON and with or without CACHE and or CACHE BOOL it wont pick it up

#

unless i really have to yoink build/ physically and not just Delete Cache & Reconficture

loud crag
astral field
delicate rain
#

Mister mister @wicked notch please show your shadows

wicked notch
#

uh

#

are you in a hurry

#

gimme 20 mins

delicate rain
#

no

#

send gh

wicked notch
#

sure

#

jaker was able to compile with VS

delicate rain
#

Thank you froge_love

wicked notch
#

clang is a requirement

frank sail
#

you can install clang through the visual studio installer if you want to use VS

primal shadow
#

Behold: Wayy too many bunnies for my renderer to currently handle!

wicked notch
#

that occupancy is sad

primal shadow
#

Yes. Here's a breakdown of problems:

  • The two green vkCmdCopyBuffer()s (4ms each) are staging buffer copies. I'll probably want to use ReBAR and directly map buffers
  • The write_index_buffer pass (10.5ms each, 2 per view) are slooooow and have horrendous occupancy. Couple of things I can try:
    • Have culling pass write out a list of all visible clusters, indirect draw to spawn write_index_buffer pass workgroups, instead of spawning 1 wg per cluster and just using early exit if they were culled
    • Try to merge it with the culling pass
    • Add software raster and hope it reduces the need for hardware raster
    • Add mesh shaders to wgpu
  • Raster (2 per view, 3.3/4.3 for main/shadow view in the first one, and then second is basically free because occlusion culling)
    • I'm 83% PES+VPC throughput, which I think means primitive assembler limited? Nothing I can do to fix this really. Software raster + mesh shaders might help, again.
wicked notch
#

I don't remember our method for writing meshlet index buffers to be that slow tbh

#

idk how it came to be like this

primal shadow
wicked notch
#

looks fine

#

idk lol

#

what happens if you profile your shader with nsight

primal shadow
#

Hold on had to go walk somewhere. Give me 15m to get back to my PC.

#

I'm suspicious that the spawn 1 wg per cluster and early exit if culled is too slow

#

Because notice both are slow, first and second pass

#

Even though the second pass should be doing nothing

glass sphinx
#

like backface and frustum

#

also im pretty sure nanite uses no index buffer

#

writing it is just too slow

wicked notch
#

Jasmine doesn't as well

#

they use a regular non indexed draw

glass sphinx
#

what is the write index buffer doing then

primal shadow
frank sail
#

vpc = viewport culling

#

turning off back face culling won't help there

primal shadow
glass sphinx
#

yes it is

#

idk what your index buffer is

#

writing out the meshlets should be at most 1/10th of the draw tome

#

time

primal shadow
glass sphinx
#

so your indey buffer is really a primitive buffer i see

#

hmmm intresting that thats so slow

wicked notch
#

ye there should be no reason to be so slow

#

send the trace, I am curious

glass sphinx
#

yes

#

ah btw

#

remove the wg barrier

#

the atomic will be optimized by the hw if you are on nv

#

the barrier will most likely make it slower

primal shadow
#

Will do when I'm home, 5m

wicked notch
#

how does that work

glass sphinx
#

amd and nv bith have heavy hw optimizations once you have extreme contention

#

so if all threads in a warp use the same address

#

it will catch that and do one atomic + warp prefix sum instead

#

instead of warp size n atomics

#

as soon as one thread diverges you die

wicked notch
#

in that case there is only one thread in the wg doing the atomic tho

glass sphinx
#

yea i believe this will be slower

#

the gain is lower then the cost of the barrier

#

so what im suggesting is doing the atomic in all threads and let the he catch it and make it fast

#

this way you don't pay for the barrier

#

maybe that will make no difference tho

#

it can make a bug difference

#

but its hit or miss

glass sphinx
#

this way you write way less

#

that was fast af

wicked notch
#

so what you're saying is that the driver just does this

const uint local_offset = subgroupExclusiveAdd(meshlet_primitive_count);
uint global_offset = 0;
if (subgroupElect()) {
  global_offset = atomicAdd(..., meshlet_primitive_count * 3) / 3;
}```
frank sail
#

shrimple as dat

glass sphinx
#

not the driver

#

its the actual hw

wicked notch
#

hardware accelerated subgroup memes

glass sphinx
#

its the same with branches btw

#

if all threads have same condition only one path is taken

#

there are more insteuction slike this where the hw will check if all threads use the same x to go faster

frank sail
#

the sc will perform optimizations like this btw

#

on AMD

glass sphinx
#

sc?

frank sail
#

the hardware also does stuff

glass sphinx
#

shader compiler

wicked notch
#

imagine trusting any compiler for glsl

glass sphinx
#

yes

#

but for runtime data its nicer to have the hw do it

frank sail
glass sphinx
#

so it can do it even if its unknown

primal shadow
#

Ok I'm back, let me see

primal shadow
glass sphinx
#

the mask masks tris in a meshlet

#

so one bit is one triangle

#

i had a limit of up to 64 tris per cluster so it was just 2 uinzs

primal shadow
# glass sphinx remove the wg barrier

Here?

    // Reserve space in the buffer for this meshlet's triangles, and broadcast the start of that slice to all threads
    if triangle_id == 0u {
        draw_index_buffer_start_workgroup = atomicAdd(&draw_indirect_args.vertex_count, meshlet.triangle_count * 3u);
        draw_index_buffer_start_workgroup /= 3u;
    }
    workgroupBarrier();

How is removing it safe? Other threads may read the value before thread 0 writes it no?

wicked notch
#

I wanted to try it but I then just hopped on the mesh shader train KEKW

glass sphinx
#

i did it yes

#

i did all methods that i know

wicked notch
#

and the tri mask was the best?

glass sphinx
#

on a 4080

wicked notch
#

well

#

ye I would imagine blasting 100 million vertices for a 4080 is just another tuesday KEKW

glass sphinx
#

well, on a 1080ti ir was kinda the same

wicked notch
#

hmm interesting

#

you just write invalid vertices if the tri is masked right?

glass sphinx
#

yea

#

i just send them to -2-2-2

primal shadow
# glass sphinx i had a limit of up to 64 tris per cluster so it was just 2 uinzs

So instead of writing a buffer of cluster|triangle IDs so each vertex invocation knows what data to fetch, you wrote out a list of just cluster IDs (1 per cluster), and then hardcoded the draw size to 64 * clusters? And then each vertex can find it's cluster via vertex_id % 64, and then you just output NaN for excess triangles?

glass sphinx
#

i should have used nan as amd fats paths those

primal shadow
#

I tried that. It was slow...

glass sphinx
#

what was slow

primal shadow
#

The draws. All the extra vertex invocations were expensive.

glass sphinx
#

yea but now your compute is slow

primal shadow
#

hrmm maybe it's worth a second test...