#Luna Engine - C++ and Vulkan
2381 messages ยท Page 3 of 3 (latest)
and the shader doesn't even use local invocation ID
it only uses global
void main() {
const uint meshletId = gl_GlobalInvocationID.x + (BatchID * MeshletsPerBatch);
if (meshletId >= Uniforms.MeshletCount) { return; }
bool isVisible = CullMeshletFrustum(meshletId);
if (isVisible) {
const uint index = atomicAdd(CullTriangleDispatch.Commands[BatchID].x, 1);
VisibleMeshlets.Indices[index + (BatchID * MeshletsPerBatch)] = meshletId;
}
}
yeah I have no idea why changing the workgroup size fixed it
what is the default if you don't give one, anyway? does it default to 1, or some device-specific amount?
1
and atomic adds work across invocations / outside of local group?
yes
then I guess it really must be some kind of intel skill issue, idk how else to explain it
tfw CullMeshlets is outputting more meshlets than were given to it 
its just predicting the future for you already
it's got to be something about the atomic add
do I need a barrier() somewhere..?
if CullMeshlets has a local size of 1, there's no error, but I'm back to missing half of my meshlets
if it's anything above 1, the meshlets appear, but I start apparently counting meshlets twice
damnit, just needed to adjust the bounds check
so, setting the local size of cullmeshlets didn't actually fix the issue, it just made it look better because it was going outside the bounds of its own batch
still have the problem of other batches not appearing, except in renderdoc
oh my god
it's...the validation layers
somehow disabling validation makes it all work
this thing right here is the cause of all my headache, what the hell
in any case, meshletisms seem to be complete
now I guess I need to learn what this "hi-z" nonsense is
: ) cool
So do you think it's a bug in the layer
it's either the layer or the layer confusing Intel
that one specific checkbox causes and fixes the issue
looking at the layer code I can't see anything immediately wrong
also why don't we have a vkCmdDispatchIndirectCount damnit
home for the weekend, now I get to try all this fun on my 3080
first thing I should probably do is get some timestamp queries
yes
pixel says nobody wanted it
well I do
and that's enough for all vendors to bend over and impl it
but also potrick and now you want it
I demand justice
how expensive is it to have a dispatch that does nothing but early return
I consolidated the entire cullmeshlets/culltriangles compute into 2 calls, but it means that at least half of the invocations just bail out
is it better to have 4 separate but properly-sized indirect calls?
idk but I'd assume the less dispatches the better
you can merge the meshlet culling and primitive into one disparch too
I guess I could, couldn't I? hmm
this is why I need timestamps, I need answers only profiling can give
I can already tell you meshlet gen for bistro (interior and exterior) takes 50us on my 3070
gen? as in meshopt_buildMeshlets?
the compute shader that outputs index buffer
ah, do you have a single compute shader?
in frogfood yes
frogfood has 2 tho
there's a new thing?
the new thing is mesh shaders
yeah jaker split it at some point
when he wanted primitive culling
but you can merge both as well
I did so before moving to mesh shaders
yeah I should definitely try mesh shaders, but it would mean I'd have to maintain both paths or abandon developing at work XD
the best solution is mesh shaders ofc but this approach is great too tbh
vkAllocateMemory(): pAllocateInfo->allocationSize is 359013904 bytes from heap 2,but size of that heap is only 224395264 bytes.
does VMA just...not look at heap sizes before allocating? I didn't tell it to use any specific memory type, it chose that one
single compute shader has been accomplished, with a small feedback buffer for culling stats
tried adding cone culling for fun...yep, 3% of the meshlets for the whole scene. def not worth much ๐
probably not worth the extra memory bandwidth at all
How are you getting the culling numbers? Readback?
ye, small storage buffer
layout(set = 1, binding = 9, scalar) restrict buffer VisBufferStatsBuffer {
VisBufferStats Stats;
};
timestamps are in, maybe implot next, but that's enough for tonight
do bruteforce backface culling
zero reason to use cone culling with software meshlets
I already have the determinant part from culltriangles yeah
I was just curious if it would help to cull them in bulk before doing individuals
So if I'm understanding Hi-Z right, you're basically taking mips of the depth buffer where the value is the closest depth value for that section, so you can prevent drawing objects behind closer geometry?
How well does that work when the camera moves quickly/on the edges of the screen?
What's two pass?
render previous visible objects
build hzb
render disoccluded objects
yet there are three steps, curious
How do you determine the disoccluded objects?
they are the ones that were not rendered in step 1, but were not culled
you test for visibility against hzb and if previous visibility was 0 and current visibility is 1 then it's a disocclusion event
Is it a hard requirement that hzb images are power-of-two?
or could I just make a normal mip chain
you'd have to cope with downsampling odd resolutions but it could work
it's just easier to reason about when it's PoT tbh
hmm, may have to mess with my render graph
I can make images that are constant size or a multiplier to framebuffer size, but not "previous power of two of framebuffer size"
I guess I can make "Fixed", "Relative", and "PoTRelative"? ๐
why not let people get the framebuffer size
because technically the framebuffer size doesn't exist until the graph is executed
actually wait, hmm
You could provide some kind of callback to get this
It could reduce API surface area since you could get rid of the other options
Or keep them if they simplify things, idc
I forgot framebuffer size is baked into the graph, passes CAN retrieve the size of other attachments at bake time, that works
I think I need to ditch this render graph for now, it's...putting up a lot of resistance to hzb
Does this seem right to you guys? Not sure if my hzb reduce is acting as it should
Mip 0:
Mip 4:
shouldn't it always be getting brighter (higher values)?
yeah
then yes
yes it's wrong?
whenever i see or read idk i hear this https://youtu.be/8X9jEej1Ttg?t=3 (until 8th second)
yes it should be getting brighter
@prisma folio I'm stupid
with reverse Z you do min reduction
therefore it's not supposed to get brighter, sorry
then I'm confused, what do the mips actually do?
if we do min all the time, we're checking what the furthest depth value in that region is, no?
actually wait, now that I'm thinking, hmm
you calculate how large the screen space bounding box of the object is, then sample a prefiltered mip based on that
otherwise you would have to sample a lot of pixels for large objects in order to determine whether they can be culled
well hzb is certainly doing...something
I feel like there has to be some kind of Y-flip that I'm not doing, because the way this screenshot looks is as if the ceiling is a mirrored mask for the meshlets on the floor (notice the gap in the ceiling matching below)
if you copy from fwog then probably
I've tried flipping so many Ys and can't figure out which one it should be
hehe
U should write a debug visualization instead of flipping the table Y
I'm actually trying something now
drawing a box that represents the uv bounds of a specific meshlet
problem is, it looks perfect
which narrows the issue to CullQuadHiZ
jesus, I think I've finally got it
meshlet cull is already up to 1ms though which is interesting, didn't you say your cull only took 50us LVSTRI?
yeah my cull is almost instant
I can do it 10+ times per frame without noticeable perf loss
I'mma grab nsight, I don't trust my timestamps yet
aaaand it crashes allocating a command buffer, nice
yes culling is a noop for me
3070
including hzb?
nsight said 0.41ms for my whole dispatch
sounds about right
does it? 0.41ms isn't exactly a noop
there may or may not be some more optimizations you can do
idk how you implemented the thingy
I never realized that the interior is a little, uh..off ๐
I assume it's just because the interior and exterior were never meant to be merged
hmm, slight issue with hzb; not sure if this is because it's picking the wrong mip or what
that vertical plank in front of the bottles gets culled
back on the intel igpu, oof
https://github.com/Eearslya/Luna/blob/dev/Resources/Shaders/CullMeshlets.comp.glsl Any easy performance wins anyone can spot? It's basically the same shader as Frogfood except meshlets/triangles combined into one. I imagine a lot of time is spent at the group barriers.
Are you on Intel
right now yeah, but even on the 3080 it was a good deal slower than what LVSTRI mentioned
The fact that 99% of threads are doing nothing while you cull meshlets is not ideal
I split the two shaders specifically to make it easier to achieve peak occupancy
I mean you could use persistent threads to have constant max occupancy in one dispatch, but that'd be hard
And you'd have to use a sketchy ahh spinlock
hmm, guess I can try splitting them again as an experiment
when LVSTRI said 50us I figured splitting them wasn't all that important, but I guess it might be ๐
culling was slow for me until I split them
with validation turned off it's currently ~11ms
I should also warn you about copying that primitive culling code, because I don't think it's 100% correct 
I'm mainly skeptical about the small primitive culling function
now that I think about it, I think this issue is caused by going straight from framebuffer size to previous power of 2
some pixels end up getting skipped so the hiz buffer isn't entirely accurate
changing it to start at framebufferSize / 2 causes weird artifacts, it doesn't like not being pot
dunno how else to fix it other than going to next power of two
which would make it a 2048/1024 image
if you're going from the vkguide version, there's an off by 1 on picking the mip level to sample
try adding 1 to the calculated mip level in the cull shader
+1 seems to have fixed the issues I had, and min reduction means I can have just 1 sample , neat
yeah that +1 really threw me for a loop back when I tried it, hence why I remember the solution offhand lol
yeah it didn't show up all the time but it came up whenever there was something right up against the camera with a small gap to look through
looks pretty good for overdraw
yeah I remember what made it infuriating was that it almost worked right
I ended up drawing debug spheres and overlaying sub-quads of the depth buffer for where the AABBs projected, not even sure how I ultimately fixed it though
all of this descriptor chat in #vulkan has me thinking again, about scrapping the entire descriptor set setup from granite and going with a daxa "just bind everything ever" approach
my plan for #1128020727380054046 is to make a single beeg descriptor set with a shrimple allocator
iirc that's basically what daxa does, the CPU handle for buffers and images is just an index into the descriptor array, so the CPU and GPU "handles" are the same
I'm still considering how BDA fits in to all this
I guess it means I can make an array of buffer pointers and put that (along with the other descriptor stuff) in a header and include that everywhere
without infecting every shader source with struct definitions I mean
yeah I was thinking either 1 storage buffer with an array of pointers or a push constant with a BDA to said storage buffer
you could have the push constant hold a pointer to a buffer with an array of buffer references
exactly
layout(scalar, binding = DAXA_BUFFER_DEVICE_ADDRESS_BUFFER_BINDING, set = 0) restrict readonly buffer daxa_BufferDeviceAddressBufferBlock { daxa_u64 addresses[]; }
daxa_buffer_device_address_buffer;
daxa apparently goes with the "one storage buffer" approach
#define DAXA_STORAGE_BUFFER_BINDING 0
#define DAXA_STORAGE_IMAGE_BINDING 1
#define DAXA_SAMPLED_IMAGE_BINDING 2
#define DAXA_SAMPLER_BINDING 3
#define DAXA_BUFFER_DEVICE_ADDRESS_BUFFER_BINDING 4
#define DAXA_ACCELERATION_STRUCTURE_BINDING 5
one master descriptor set with 6 bindings
you need only 3 btw
but yes this method gud
I can vouch for it
since I stole it from potrick 
actually, you need just one
https://registry.khronos.org/vulkan/specs/1.3-extensions/man/html/VK_VALVE_mutable_descriptor_type.html
btw, if you have a fat array of buffer addresses, how do you tell each shader where the buffers it cares about are?
do you just send some push constants with indices
or perhaps just one buffer pointer with an array of indices
yes
you send ids through push constants
probably a push constant with an index to a "master" buffer that has other indices
finding the master buffer isn't the (imagined) problem
ok ye what lvstri showed is what I was wondering
yeah
huge array of addresses, then you tell the shader where X buffer is and it will fetch and reinterpret the pointer
yeah, you just have the same huge array of pointers defined N times with N different types
ye
no actually
the buffer with all the addresses is declared once
as array of uint64_t
you then reinterpret
oh wait I thought you meant this is the alternative
this is what I meant
alias all the images with all the qualifiers etc
since this is time consuming you:
A. automate this and generate all the macro permutations
B. define only a subset of aliases and let the user choose further aliasing with restrict/readonly/...
I recommend B
the buffer with the addresses is defined multiple times, no? each different type has its own layout(set = 0, binding = 0) of the same buffer
nope, just once
storage_images_restrict_coherent_volatile_readonly_writeonly[i] 
you don't need to alias the address buffer
since you're gonna be reinterpreting later, with BDA
I'm confused because that's exactly what the code you linked is doing though
oh
yeah, I should've used the daxa example
Potrick does it better than I do
I'm describing what potrick does
What I'm doing is I'm aliasing storage buffers directly
because I made a grave mistake
But you shan't commit the same mistake, do it the goodโข๏ธ way
I'll refactor this later 
where is this defined RetinaGetStorageBufferMember
Bindings.glsl
I'm trying to read daxa code but I can't find any actual shaders, just the includes
you should probably look at daxa's "daxa.glsl" for the correct way to do this
how does one reinterpret a buffer
you declare the buffer type with buffer_reference
and then you do this "Type name = Type(masterAddressBuffer[id])
Like a functional style cast
descriptor buffer is nicer than the horrid descriptorpool/set thingy
but thankfully you write this once and forget about it forever
so it don't matter much
yeah you really only need 1 pool/set for this anyway
ye
also thinking about nuking buffer usage from my api and just setting it all
dew it
every day I stray further from granite
I notice none of the code accessing buffers/images has nonuniformEXT, isn't that needed?
like in rchit when it's getting textures
okay cool just making sure I wasn't missing something
hmmm, starting to suspect something's up with the occlusion cull. no meshlets fail the hi-z, even at this angle? ๐ค
Random thoughts to investigate:
- Singletons vs static classes vs namespaced loose functions
- Custom memory allocator? Plus containers that can use it
- Windowing system that supports 1 main window + any number of child windows (imgui viewports)
- Separate render thread
- Object creation:
device->CreateTexture()vsTexture::Create() - Shared GLSL/C++ headers
at first point, might as well add dependency injection too, could come in handy later when object creation device->Foo
yeah I'm just trying to think of where to put them; I don't want the shaders to be in my cpp folders because where will they go if I built for distro
I just put them next to my shaders 
that way it's easy for the scuffed shader includers to find them
and on the c++ side I just have to add the shader folder to the list of includes
thats a build system issue not a file organization issue
everyone's doing cool shadow stuff but me, I need to start doing things again
usually it's deccer who tries to resuscitate me
behold a project, this time using a lot of Retina-isms to make the code cleaner @lone moon
been at this for the past week or so, bikeshedding what kind of code style I wanted
you have discord timestamps in milliseconds?
no but log entries my frog
ah
time for me to write nothing vulkan-related for the rest of the day
I need filesystem access
how to sue discord people
oh shit this isn't google
how to delete discord message
jokes aside I'm stoked that my code served as inspiration for you
how I'll sleep tonight knowing that at least one (1) person found it useful:

Ye, it's a bit of a different paradigm from what I'm used to (all of the static Create functions were weird to me) but it's kinda grown on me
template<typename T>
class MyClass {
T* Thing = nullptr;
[[nodiscard]] constexpr auto operator*() const noexcept -> T& { return *Thing; }
};
Question for the C++ spec experts, would this operator automatically switch between returning T& and const T& depending on the const-ness of the class?
Or do I need a separate declaration for both?
I feel like using decltype(auto) would make it just return a copy though
auto may drop cv qualifiers i read somewhere few weeks ago when i was wondering what decltype was
decltype(auto) preserves that and reference-ness
<source>:16:24: error: static assertion failed
16 | static_assert(std::is_same_v<decltype(*c2), const int&>);
rip
decltype(auto) doesn't save it either
shiat
guess I do need separate definitions
interestingly I am wrong about this one though, decltype(auto) does make it return T& and not just T
if I don't have a postfix return type at all it does return T, interesting
returning auto is always returning value
LUNA_NODISCARD constexpr auto Get() noexcept -> T*;
LUNA_NODISCARD constexpr auto Get() const noexcept -> const T*;
LUNA_NODISCARD constexpr auto operator*() noexcept -> T&;
LUNA_NODISCARD constexpr auto operator*() const noexcept -> const T&;
LUNA_NODISCARD constexpr auto operator->() noexcept -> T*;
LUNA_NODISCARD constexpr auto operator->() const noexcept -> const T*;
mildly annoying but oh well
you'd do auto& for ref
why is that different from returning decltype(auto) tho
oooh
decltype(auto) forwards whatever is returned by the function
and forwarding is always done by ref?
or does the builtin dereference operator just return references
not sure
probably builtin dereference returns lvalue reference
you can get rid of it with std::remove_reference
int* a;
static_assert(std::is_same_v<decltype(*a), int&>);
looks like dereferencing a pointer just gives a reference, TIL
never had to think about these kinds of semantics before
about this question itself, probably not, because T* is not a pointer to const T, so when you dereference it, it's not const
class being const would mean that the member is a const pointer to non const T
though if T is const T then it should work I think
I am ignorant, is const not just a compile time signal to the compiler to error on attempts to write to something?
why would it have any impact on pointers and references at execution times if so
because it lets compiler do more optimizations knowing that something doesn't change
ahh that makes sense
it is even UB to cast away constness and modify through something that was created as const
yeah that makes sense actually, given the optimizations, it's like you promised the compiler something and then broke your promise after it did some things based on that promise
maybe with a unity build it can figure those things out?
shouldn't be any different from link time optimizations
resharper C++ I think warned me about const issues I wasn't working on a template though
and yes compiler can analyze code to decide which transformations to apply but it's less reliable than being explicitly const correct
const-correctness is also the way to go because it signals to developers what they should and shouldn't do
yeah should make use of the syntax to clear up semantics wherever possible, they even made a language all about it it's called rust
we don't need to throw out the baby with the bathwater
very apt phrase in this context I think
Yes
It's only not UB if the original object was not const
yes
Oh I switched your first two words in my mind and thought you were asking a question
yes
void DoThing(const int& a) { const_cast<int&>(a) = 5; }
int a = 1;
DoThing(a);
you mean this isn't UB?
Ye
that's bizarre to me but alrighty
it was created not const so it makes sense it's not const even if you slap const on it
I think that's basically why const cast exists to begin with
I thought it existed because there are old C APIs that have no notion of const correctness
I guess that makes sense, I figured const cast was for passing objects to functions that weren't written with const even if they don't modify the object
yeah basically that
Yeah
I thought const_cast was basically "Take away the const so I can pass it to this function, but I promise I won't actually modify it"
That's how it should be used
what are the rules for where "begin" is
Hm?
if I am assigning a literal value for the first time to a const variable it is clear that this value was originally declared as const
but
sometimes we get data from places other than literal assignment
what if I'm reading from io
what if the value was derived from a library
it doesn't matter what you initialize the object with
what matters is that the object is const
oh I see
we are passing by reference in Eearslya's example
I get it, it's a copy otherwise
Anyone here use the shaderc lib from the SDK? The CMake documentation says that shaderc_combined is a static library, but I'm getting a runtime mismatch error.
lld-link: error: /failifmismatch: mismatch detected for 'RuntimeLibrary':
>>> libcpmtd.lib(xstrxfrm.obj) has value MTd_StaticDebug
>>> shaderc_combinedd.lib(shaderc.obj) has value MDd_DynamicDebug
clang++: error: linker command failed with exit code 1 (use -v to see invocation)