#Iris - A Journey through OpenGL and beyond to learn Graphics
1 messages · Page 10 of 1
binding as in vkQueueBindSparse?
I think I’m wondering if sparse can be attached to a frame buffer
Like the entire texture region (even if partially committed)
I think so
hol up lemme read the spec
ye there is no mention of sparse resources being incompatible with VkFramebuffer
I have acquired a machine capable of doing sparse
(for legal reasons I have to disclose that I did not, in fact, steal the machine)
Grinding GP while on vacation 🫡
i had the very same thought
Wtf
hook up with some locals
I do have stuff to do tonight as well
I thank god every day that my body seemingly has limitless energy
I don't feel tired for whatever reason 
maybe I am going to die
I am going to die soon yes
I don't feel the physical sensation of being tired, but I notice that I become dumber
only possible explanation
hascheesh al jaffar al jaker... i think we will get nanite 3.0 very soon
running on some fujitsu t450, 2gig of ram, some pentium 4 thing
on a custom version of mesa
hmm jaffar ... reminds me of these
I can at least rest assured that AMD performs well with sparse
on RADV at least
pixel said amdgpu shouldn't be too different as well
so common amd W
then somebody needs to convince novideo to make it right on their chips too
Yeah once ue6 is dependent on lvstri's stuff, they will
call them back
use force if you need to
I have air support if needed
awesome!
I did some experiments with OpenGL sparse virtual shadow maps today actually
holy pog
they seem to mostly work when used as a framebuffer target
yeah used the extension
support seems to be pretty strong for the 1060, so hopefully newer hardware is even better
eggcellent
but it allows me to commit and decommit, and I'm using compute to write to some readback textures which tells the CPU what to do
Mostly?
can I see 🥺 👉 👈
I might see if I can try stencil buffer rejection for the pages
If it wurks
ye, the shrimple fact that it works is hands down incredible
I will try to capture the framebuffer
nsight really really hates 16k textures
at lesat on this gpu
the black parts are uncommitted
as the camera moves it allocs/frees pages
what do you think about stencil rejection for pages?
anyways, do you need to keep track of pages yourself?
I wonder if it would be viable
Did you read the final fantasy shadows paper
yeah this is done with a 128x128 residency texture
well two of them
one for this frame, one to look at prev frame
saw someone linked it but I forgot to read it lol. Did they use vsm?
I have no idea for stencil, Froyok is far more experienced with stencil
hmmm maybe we need to ask Froyok
oh so two page tables, how come
basically it seems that on OpenGL at least
if you read from an unalloced page, it returns all 0
so what if we explot that by setting alloced pages to some value other than 0
and have stencil rejection regions
No but it was pretty cool
Anyways combining vsm with it would be cooler
basically it just checks "if previous frame it was allocated and current frame it is as well, do nothing"
or "if previous frame it was not alloced and this frame it is requested, alloc it"
hm, perhaps my idea of keeping only one page table isn't viable
do you use morton codes btw
maybe we need to check it. @wicked notch did you read it?
to store the pages
what are those 
only halfway through
but they don't do VSM, their shadow map paper is mostly for point and area lights
Funny numbers
they have a pretty interesting idea where they make a frustum per screen tile and bin depth values, then they make a close up shadow frustum per tile, based on the binned depth values
but I dunno how it actually works
how do you access pages then 
- one 16Kx16K sparse texture (hardware supported)
- two 128x128 residency tables (regular textures)
- a readback SSBO that the compute shader writes the changelist to
all I did was attach the texture to a framebuffer lol
you use textures for page tables? 
no everything is on one level for now
yeah for the analysis part I just take the depth, convert to world space, then convert to shadow uvs and index the table that way
I might give the stencil thing a try later
hmm
yeah I think you're right
we'd probably need a page table for each mip level right?
yeah definitely
making yet another branch..
btw how do your VSM look
let me see them sharp shadows
same this has been my experimental shadow branch
not sure if it'll ever be ready to merge in
like evict a page from memory?
yeah
yeah right now I just have it set to
if a page was around last frame but not needed this frame, evict immediately
(not caching anything atm lol)
it seems to be raster
I think it could be a lot better since it wastes time with non resident pages
which is kind of why I'm wondering if some kind of stencil method to reject invalid pages would help
my guess is page rejection will go a long way with performance
but to truly make it viable for older hardware caching would need to be really good
hmm I'm wondering what is the probability that NV's driver is actually doing some deferred eviction kind of thing
Yeah it’s probably pretty high
Hmmm good point though, better eviction strategy is definitely needed
my idea for Vulkan was not evicting, ever
instead I keep track of pages internally and just overwrite stuff where things change
because the idea is that you allocate big chunks of memory and then you suballocate them to sparse pages
That would work
eviction in this case would mean deallocating the big chunk of memory
A few
well I'm thinking to make the pool size 1GiB
or something crazy high like that
it's something to experiment
One really nice thing you could exploit is caching if you go that route now that I think about it
Since you never remove pages
True
also can you trust the GPU to not invalidate memory 
What's the question ? 😄
Oh yeah I think maybe it boils down to how much performance can be gained with stencil rejection. Is it similar to increases we can see with early depth rejection?
If the cost is on the fragment shader side (and not the raster itself), then I think so. 🤔
I guess it's raster only
so maybe some other sort of culling is necessary
mesh shaders would be so good here
you can just use a task shader to cull invalid pages and dispatch only the necessary meshlets
Yeah think this one is raster bound mainly, dang
Oh wow
That’s massive actually

The alternative is work expansion in compute shaders
which should work just as well, it's just painful to implement 
the core idea is the same
check if the page is invalid, check whether any meshlets overlap that page, if both are true cull meshlets for that page
Yeah true that would work
I saw a paper a while back too where they estimate if a shadow will intersect the frustum
the only painful thing to check for is whether said meshlet overlaps many pages
iirc the nanite talk went over that
they also do HZB wtf
HZB except for memory pages 
so much indirection
I lost count
Hmmm yeah that’s also true
Though I do wonder how much better your perf will be since you’re using nanite 2
I’m testing just with regular meshes
probably a bit better
the reason it works so well is due to the finer grained nature of meshlets
it could work just as well with regular meshes to be fair
just perhaps less of an impact in 🅱️erf
: )
if I'll be still lucid by then
ye it's unlikely that I will ponder them 
we need a cooler name for it too btw 🙂
I'll leave that to my aide JStephano
dropped jaker like a hot potato
if Jaker is my right hand man, JStephano is my left hand man
I have two hands for a reason
Tell everyone there about it
They'll be impressed
finally
the thing that will destroy darian
not even I can load this
it crashes while trying to allocate 20GB of VRAM (totally reasonable and valid amount of memory)
yeah you need an AMD GPU to load that
one of them workstation ones with 48 jigs?
why one file btw 🙂 why not split it into chunks?
"somehow"
omg
I'll send it to darian soon
I have a gigabit internet here, luckily the netherlands are a civilized place
seems like the future of game sizes will be bleak.. 16gb of just geometry seems...
fuck I don't even have enough space on my drive
🙋♀️
wait until jakers vkspec llm turns hostile and generates mesh out of text
I envy my friend so much right now, his workstation exported 30GB of FBX and converted it into 20GB of glTF, all in 2 hours
some threadrepperino?
is that its operating temperature
no that's the idle one
operating temperature is somewhere between the surface of the sun and the core of super star WR102f
wolf rayet stars are cool shit

ahh yes, don't you love it when a single layered/mipmapped texture takes up 32GiB of VRAM
Oh wow wtf. Just from one 16k mipmapped texture?
I also layered it 32 times over 

we found the random blogs that has fake sparse http://gpupro.blogspot.com/
(MJP did, god bless him, literal MVP)
all this talk about vsm makes me want to make my own vsm thingy 😄
berndt das brot
I don't wanna do it if you want to implement it in FF though
Since I have plenty of other things to look at
I'm not sure
on one side GL's sparse is much easier
on another side I know it's gonna perform very bad so fake sparse will be necessary 
The impl I was thinking of wouldn't use real sparse anyway
I'm looking at SVT's slides right now
and I guess I get it
but the UV remapping to the real texture is weird
the hell does this mean 
paging is offset galore truly
tl;dr indirection
I feel like reading code will just increase confusion at this point
The whole thing just seems like indirection and basic arithmetic
ok yes it's probably less clamplicated than I expected
I still fail to see how to translate virtual uv to physical uv though
ok suppose I wanna make a physical texture that will have at most 4096 128x128 pages
so basically a 8192^2 R32_SFLOAT texture
then I take a regular 2k texture and stuff its pages in the physical texture
now suppose I want to get the texel at uv (1, 1)
This will translate to page indexed (127, 127) I think?
which will again translate to somewhere in the physical texture, suppose the page also gives us the uv offset into the physical texture
I guess they're using one big texture2D for the backing memory
why
how do I decide how big to make the backing storage
isn't it better to have many smol ones
ah yes
If you have a strict budget that you can decide at init time, then surely you can get by with one big boy texture
anywho
we got the correct offset into the physical page that has the sample we're looking for
horray!
except it's just an offset, specifically it's the lower left corner of the page, we need to offset the offset by something
um here's what I was imagining
vec2 virtual_uv = WorldPosToShadowUv();
ivec2 page_texel = ivec2(pagesDimensions * virtual_uv);
vec2 physical_uv = fract(pagesDimensions * virtual_uv);
int physical_index = texelFetch(s_virtualToPhysical, page_texel).x;
float shadow_depth = textureLod(s_shadowArray, vec3(physical_uv, physical_index), 0.0).x;
I made the backing texture an array of pages to spice things up
are 0, 1 and 2 the backing textures
these are pages
we don't know or care where they (physically) are, nor their size
the virtual2physical texture will tell us where they are physically located
and I guess what you do with size depends on how you handle the backing memory
maybe
using an array for the backing memory seems to most logical here, since it doesn't give any false impression that virtual pages map directly to physical ones
this picture is gonna be the hottest drop of 2023
here it comes
alright so lemme explain this shit
the first texture, marked with zero is the shadow map
let's say we are good boys and it's only 2k
we wanna sample in a particular texel
alright no prob, we go onto texture 1, the page table
the page table tells us the uv of the lower left corner of the physical page in the backing texture
very nice, let's go onto texture number 2, the humongous backing texture
now here's a problem, we know the page we want to sample from, but the actual texel info is lost, we cannot use the original offset because the texture sizes mismatch
and we don't even have an offset to begin with, since we only know the virtual uvs
how do we deal with translation in this case
ok I realized an error in my code
virtualShadowMapSize * virtual_uv
should be
pagesDimensions * virtual_uv
you need to get the UV into the page in the first step
ok so we get the uv of the lower left corner of the page we are sampling in the virtual texture
then we subtract = offset into the page
right?
ok so I guess you can ignore the actual size of the virtual texture here (when sampling)
you just need to know the UV and how many pages there are on each axis
I'll draw another pic
lol I didn't want to calculate it since it would be a yucky number
but we know where the physical page is and the offset into it, so it would not be hard
it should be lower left corner + (0.25, 0.25) * ????
1/3 is your last number innit
how did we solve elementary skool 
listen
I mean pass 
I like the random red herring of specifying that the whole VSM is 256x256 and then never mentioning page sizes again
This is awesome lol
ok jaker
anyways yeah it would be simpler to use an array texture because you wouldn't have to do this incredibly difficult + and *
Full blown virtual texturing next I guess
now solve linear mip transitions
ikr, literally the hardest operations
once you know how to take one sample, taking more isn't difficult
is there a spec on how shrimplers work
you can use hw filtering for intra-page filtering, then handle edge cases yourself as mjp mentioned
like linear filtering in Vk or GL
yeah, the api specs
Only aniso
mipmap filtering (trilinear) is actually quite shrimple
you just sample each mip, maybe filter the per-mip samples, then blend them together with fract(lod)
good ol lerp
So I guess with this one the 2048^2 represents our “true” shadow map which covers the entire scene
alrighr @wispy spear pin this utterly deranged rambling when possible
ye
it should've been 256^2 because then I would've actually been able to understand what was going on 
the only thing left to do is the allocator I guess
but that's a problem for tomorrow
tomorrow's me will surely be far more intelligent and not dumb at all, for sure
thank god the allocator is the same both for fake and real sparse
Real fake allocators
how do I name this allocator
sparse page allocator?
real fake sparse page allocatorn
William
william it is
Damn, I wake up to another, 2 hours long, uber detailed explanation
hey thats devsh
so, what heuristics are used to determine if a page is resident? do we just analyze the gbuffer each frame?
I guess that is easy enough
just depth
ye
big brain: don't mark request it resident if albedo is 0 since the shadow won't be visible anyways
step 1. unproject_depth()
step 2. to_page_index()
step 3. make_resident(page_index)
ah yes
so the only hard part about this is the CPU readback crap
unless
allocating/freeing pages via a compute shader (only viable with fake sparse of course)
how the hecc are you supposed to allocate pages with compute
you need to call in API functions no?
hm then yes
not having to do cpu readback is compelling ngl
but it's necessary once you run out of backing memory
otherwise it isn't very virtual at all 
the virtual part is remapping pages
and possibly pretending you have more memory than in reality
you can make this work with clever use of clipmaps and fixed storage I'm sure
perchance
for instance, a worst-case where you can see the whole scene can just allocate pages for a coarser map instead of filling the entire higher-detail map
I guess determining which level of the clipmap to use is when the heuristics get hard
maybe not
you can just prioritize coarser clipmap levels when you need to evict
assuming they overlap like this
the thing is, there's always a budget no matter what you're doing
with real sparse, you don't need to max out your budget immediately
well I guess with non-sparse you can just create separate textures that act as your memory allocations, so idk
but yeah I think figuring out how to handle eviction in the worst cases will be the hardest part of this
could you use the other 10bits for that perhaps?
the way I was thinking about memory is the same for fake and real sparse
with real sparse I would've allocated 256MiB VkDeviceMemory every time
with fake sparse I allocate 256MiB physical textures
ah the 10 bits thing was related to how the hardware handled rasterization internally. I wasn't clear about that
but ye eviction will be very bad to handle
maybe if block of pages hasn't been touched in X frames evict
depending also maybe on the remaining budgets, how many pages are currently active, etc
I think you should only evict if needed
that way you can avoid writes if nothing changed
ye, if the remaining budget is: "a lot", then no eviction
I reckon there needs to be an atomic queue (or multiple) for pages
hm?
when you allocate a page, push to the queue. when you need to dealloc, pop the oldest thing
oh, nah I just use morton codes
idk I remember they did something similar for surfel allocation in surfel GI 
I make an array of 128x128 unsigned integers, last bit is the valid bit, each frame I clear everything to invalid and then I readback the valid pages and mark them
alright so that gives us the invisible pages I suppose
but that also means no reuse unless you use another bit
what does this bit do
it tells us that something has been rendered to the page at some point
if the page becomes visible again, we needn't re-render the page unless the sun moved or that area of the scene changed
ah
which is admittedly a bit of an endgame optimization
JS already has shadow map caching, hurry up 
I'm still conflicted on whether I should first finish real sparse or already move on with fake sparse
if you do real sparse, I'll do fake sparse
then we can compare
I'll have to go to bed now because I need to wake up and perform manual labor is 5-6 hours 
💀
This is what I ended up doing
Every frame ended up being excessive
The read back is an ssbo that compute atomically pushes work to for the CPU
Though it still might be necessary to add a sort of allocator hardware budget to prevent it from trying to dealloc too many pages during one frame
hmm another problem has arose
Not every virtual texture will have the same resolution, some may be 16K, others may be 4K, so their page table sizes too will differ
Eh actually this is easily solvable, I just have to make yet another buffer storing each texture's page table offset in the main array
why are you asking this here 
OH FUCK
I dun have a stretchy monitor
no prob lmao

void main() {
const ivec2 position = ivec2(gl_GlobalInvocationID.xy);
const uvec2 size = imageSize(u_visbuffer);
if (any(greaterThanEqual(position, size))) {
return;
}
const uint depth_bits = uint((payload >> 34) & 0x3fffffff);
if (depth_bits == 0) {
return;
}
const float depth = uintBitsToFloat(depth_bits);
const vec2 uv = (vec2(position) + vec2(0.5)) / vec2(size);
vec4 world_position = inverse(u_camera.data.pv) * vec4(uv * 2.0 - 1.0, depth, 1.0);
world_position /= world_position.w;
const vec4 shadow_uv = shadow_data_ptr.pv * world_position;
const uint shadow_tile_x = min(uint(shadow_uv.x * VSM_RESOLUTION / VSM_TILE_SIZE_X), VSM_TILE_SIZE_X - 1);
const uint shadow_tile_y = min(uint(shadow_uv.y * VSM_RESOLUTION / VSM_TILE_SIZE_Y), VSM_TILE_SIZE_Y - 1);
const uint shadow_tile_index = shadow_tile_x + shadow_tile_y * VSM_PAGE_COUNT;
atomicAdd(vsm_page_req_ptr.pages[shadow_tile_index], 1);
}
``` amazing
Is this real sparse
well this is just the page request part of real sparse
the thing where you see which pages are needed
That's just analyzing the gbuffer eh
Depth
Wait so you aren't even touching the sparse texture here 
Ah
love to see pixelwarps in your hw raster
You should probably transfer to host after this pass or something
🥸
ye I'll do some transfer shenanigans
Darian said it's even more painful than HOST_CACHED 
alright
surely much better
but will I read correct data?
or rather
will reading from the CPU be absurdly slow?
I shall now test
welp
it takes half a millisecond to read 16KiB of data
jesus
hmm
I think a dedicated transfer operation is necessary
I don't need to care about memory types though do I?
Like I always transfer from DEVICE_LOCAL to HOST_CACHED
yup, dedicated transfer op is a noop basically lol
and reading takes 2 microseconds
amazing
I mean just the basic ones
Hmm
once a virtual page is mapped to a physical page, should this mapping remain the same until the virtual page is evicted?
or rather, overwritten, not evicted
I don't see why you'd need to shuffle around allocated page tbh
maybe not for shadows
but for textures in general
actually, maybe for shadows too
I am suballocating from a 256MiB VkDeviceMemory after all, I need to free up space whenever possible
Say in frame 0 we draw to virtual page 0, this will map to physical page 7 or something; now suppose we never draw to virtual page 0 again, the physical page 7 will never be used again
If I don't "free" it
here, have mine.
ok the roadmap is as follows
step 1. get id of the virtual pages requested from the GPU
step 2. invalidate all pages
step 3. for each requested id, if it's not resident allocate new page
I am skeptical about the invalidate all pages step, but it's a logical operation so it's fine I guess
Jaker can you fact check
JS you're welcome to share the neuron and fact check too
Invalidate all pages seems okay
I think it will even work when you add caching, provided you add a little more bookkeeping
I'm thinking of it like a linear allocator
that gets reset every frame
except already resident pages that didn't change shouldn't be touched
so actually hold on
I'm holding for dear life
ok I got it
I keep track of 2 things
allocated pages and non allocated pages, for the allocated pages I also keep track if they are resident or not
when I get the requests from the GPU, there are a lot of things I have to do
while I was writing I decided that this is stupid
Better method: only keep track of allocated and unallocated pages
Now:
- requested page is already allocated => do nothing
- requested page is not allocated => allocate and update sparse bindings
- if page wasn't requested => deallocate it
how does this sound
Most of the time that will be really good
One exception I ran into was big camera changes if you immediately dealloc
the only time I call in vkQueueBindSparse is if 2 happens
Like frame 3 might get overwhelmed with dealloc requests
Hm
Then frame 4 the dealloced pages get requested again
btw deallocation in my case effectively does nothing
it just changes one bit in the page table
ah ok I don't think that will be a problem then
so memory still stays around?
ye memory and sparse bindings are untouched
Ok I think that will be fine since it's a cheap operation
I make a promise to not use that page again
if I break the promise I die
actually
I can break the promise
until I do caching
if I do caching and break the promise then I really will die
no worries, ill resurrect you
I wonder if SDSM would even be a reasonable thing to implement with VSM
caching would instantly die
eh, it would probably suck since the whole point of SDSM is to tightly fit the shadow map to the frustum, which would play poorly with the concept of sparse allocation
speaking of shadows
did anyone of you guys tried Tetrahedron Shadow Mapping yet
i want to learn it because it seems a really cool to have omni shadows on a 1x1 grid
I have not seen it
Introduction:In this paper, I will describe the process of my implementation of Tile Based Omnidirectional Shadow Mapping, which is an effective way o
only free resource about it is some student project
i have a huge problem right now regarding omni shadows because the filter kernel leaks into the other parts of the cubemap
yeah 😦
there is a source code
ok the algorithm seems relatively shrimple
I rarely see user-assigned clip planes actually used for anything
the draw call generation is interesting
ye caching is dead with SDSM
but we more than make up for it with clipmaps
@wicked notch how come you only came out of hibernation half a year ago or so and not already when you joined this frog pond 2 years ago? : )
I joined this pond 2 years ago due to my uni project I think I mentioned
oh, i may or may not have missed that
it was a Vk project and most of the questions I asked here at the time were things my classmates had written 
i dont rember seeing you being active before hence the question 😄
ye I basically wasn't active
either way, im glad you are here
I think the discord server panel even has a buzzword for this
it's called activation time or something
#define VSM_TILE_SIZE_X 128
#define VSM_TILE_SIZE_Y 128
#define VSM_CHANNEL_SIZE 4
updates.clear();
for each page_index in requested_pages {
if (page_index == 1) {
if (page_table[page_index].is_allocated()) {
continue;
}
auto memory = VkDeviceMemory();
auto offset = VkDeviceSize();
page_table[page_index] = allocate_page(page_index, &memory, &offset);
updates.emplace_back(VkSparseMemoryBind {
.resourceOffset = VSM_TILE_SIZE_X * VSM_TILE_SIZE_Y * page_index,
.size = VSM_TILE_SIZE_X * VSM_TILE_SIZE_Y * VSM_CHANNEL_SIZE,
.memory = memory,
.memoryOffset = offset
});
} else {
page_table[page_index] = deallocate_page(page_index);
}
}
if (!updates.empty()) {
vkQueueBindSparse(...);
}
ok good
this will quickly get out of hand for textures though
actually nevermind
this is completely bogus for regular textures and will work only for VSM
Ironed out it kinda looks like this
struct sparse_memory_block_t {
VmaAllocation memory = {};
VmaAllocationInfo info = {};
uint64 allocations = 0;
};
struct sparse_memory_page_t {
std::reference_wrapper<sparse_memory_block_t> block;
uint64 offset = 0;
uint64 size = 0;
};
struct sparse_image_memory_bind_t {
image_subresource_t subresource = {};
offset_3d_t offset = {};
extent_3d_t extent = {};
sparse_memory_page_t page;
};```
I am sort of unhappy still
as it turns out
an unconditional vkAllocateMemory inside the render loop will lead to bad things
This is the single, most fucked up, most inefficient and most ridiculously non-scalable piece of code I have ever written, in my entire life
but it somehow works
actually I take "inefficient" back
it's actually pretty good 
averaging 30usec
but it is still stupidly non scalable
I gotta separate the "page allocator" and the "block allocator"
very tedious overall 
things left to do:
- build a mip chain out of the page table, to be used as a "HZB"
- build meshlet lists for the hw and sw rasterizers and cull pages
- actually draw the shadow map based on the info above
oh btw, @raven orchid how bad is clearing a 16k render target? 
Actually pretty good!
Well though I’ve never had the full thing in memory though
But clearing the sparse target every frame hasn’t been bad so far surprisingly
Block is collection of pages or is it something else?
I’m actually wondering about your approach to drawing the shadow map since you have combined hw and sw rasterizing. This step has been one of the worst for me
I draw into an R32_UINT shadow map
and I imageAtomicMin it with floatBitsToUint(depth) for both the hardware and software paths
so rip early Z
but I already know it won't be an issue
I'm confused didn't you do rt a little while ago? How are you doing insano shadow stuff now 🥸
I'm interested where this shadow map hole leads you to, I need inspiration, I hit a bit of a slump when it comes to my shadows
We've been discussing it in here, #1090536732769927178, and #1128020727380054046
Will make sure to check it out, thanks
Okay I have no idea what VSMs are nor what you talk about in any of these threads but I'm sold already
Give me two days to catch up lol
basically we are decoupling storage from our shadow map, allowing us to act as though the shadow map is huge (16k^2), at the expense of having to determine active pages and manage memory ourselves
So you are pretending you have huge shadowmap but only actually draw the sampled parts into a much smaller texture which you manage manually?
Nvm I'm starting to get it
Taking the risk of sounding really stupid, could you not make your VSM span the frustum (similarly to SDSM) and then virtually reproject the tiles from last frame that are still in the new frustum this frame + draw the new tiles you need where the frustum moved?
It could work if your frustum snaps to the world grid or something
That way tiles can reproject cleanly
I reckon you'd also have to transform light space depth from the old frame into the current
Why would you need that? The light space depth should be invariant to the frustum position no?
The thing that worries me the most with having small "tiles" is the amount of times I'll need to draw the scene
I'll prob need to yeet someones culling in order to make this viable
no, you draw the entire thing all at once
but you do culling per tile
i.e for each mesh/meshlet you determine if it overlaps any tiles that will be needed for sampling
if it do the you draw it
writes into unpaged memory get discarded with real sparse, but with fake sparse you should make sure to not write into unpaged tiles
I dunno what kinda substances unreal devs sniff to come up with this shit but boy am I glad they do

Yeah the soda there must have a lot of minerals or smth
I am considering using a linear allocator for this shit
I now realize that 250 microseconds on average is far too much on the CPU side to update just 16k pages
I'm also considering placing a fixed amount of pages that can be updated each frame
but that's if the linear allocator thingy fails
linear allocators are good
its really impressive how fast atomics are on gpus
its so good
Alright this is hard
Same
Here's how I'm thinking of solving the amounts of page requests per frame
That's what he said
layout (scalar, buffer_reference) restrict buffer b_page_table {
uint8_t[] pages; // this is num_pages_total, for all virtual textures that exist
};
layout (scalar, buffer_reference) restrict buffer b_page_req_table {
uint count;
uint[] pages;
};
void main() {
// do whatever it is I have to do to get a page index for this frame or PAGE_INVALID if this pixel requests no pages.
const uint page_index = find_virtual_page(...);
if (page_index != PAGE_INVALID) {
const uint8_t page_value = atomicExchange(page_table_ptr.pages[page_index], 1, gl_ScopeQueueFamily, gl_StorageSemanticsBuffer, gl_SemanticsAcquireRelease);
if (page_value == 0) {
const uint slot = atomicAdd(page_req_table_ptr.count, 1, gl_ScopeQueueFamily, gl_StorageSemanticsBuffer, gl_SemanticsAcquireRelease);
page_req_table_ptr.pages[slot] = page_index;
}
}
}```
Ok pardon for the long time, I was thinking about it while I was writing 
I call in the atomics man: Wpotrick
please analyze this code 
summoning @glass sphinx
I should also note that this is to make CPU readback easier by sending in only page indices that have actually been requested, instead of all the pages
given that the memory is deallocated automatically every frame, since I'm switching to linear allocators
it detects which tile of which texture this pixel is going to sample from
alright monkey man, have a nice workout session
Hmm I'm still thinking
Now that I have all requested pages, there is no way to guarantee order
so each frame I would end up doing a huge number of page requests
or wait, not exactly
the page requests would remain the same
But their location in memory could differ
Do I want this?
that might mean some of them are colder in cache than otherwise could be
but I dunno if your actual access patterns make that noticeable
Alright I was terribly wrong
I do need to update page bindings too with a linear allocator
Each page will already be a multiple of the cache line size so it doesn't matter too much where they are
After very careful thought
A linear allocator is not usable for this kind of thing
I need some allocations to be persistent
My brain is currently at max capacity, overheating and I still don't have a solution to this problem, despite thinking most of the day
The thing is also conceptually very easy too:
- Get page indices requested from the GPU
- For each page in the request, check if the requested page is resident, if it is do nothing, if it isn't, allocate and update sparse table, also if the page isn't requested deallocate it
And yet, this performs terribly, even with a first fit allocator
250us on average to update a mere 500 pages
How do I fix this
I don't think a linear allocator can fundamentally work either, I need some pages to stay resident in between frames
I thought also about maybe a "bump allocator" with a ring buffer, but that still doesn't sound right
kinda fallen behind this problem, but why can't you split your allocation strategy between persistent and non-persistent stuff
I'm not sure how to do that split
any page can be resident for any number of frames
"Persistent" here means, "the gpu requested this page 2 or more times in a row"
the first frame page 0 is requested it is allocated, if the next frame page 0 is requested again, nothing is done (no deallocation or moving of sorts)
english died for a sec there 
oh then rip
Does a bump allocator with a ring buffer make any sense whatsoever
maybe ¯_(ツ)_/¯
why not just have some central bitmask block you can check to see if a page is taken
I think some real OS allocators use that
It doesn't really matter, but 64KiB
idk it probably won't be that big though
It would be 16384 bits big
Because in one virtual shadow map there are 2^14 pages
that's not that bad
Hm perhaps
So the bump allocator I think would work like this
auto curr_ptr = bump.current();
while (is_page_allocated(curr_ptr)) {
curr_ptr = bump.advance();
}
allocate_page(curr_ptr, ...);```
Doesn't look too bad
Advance automatically brings the ptr back to the first slot once it has reached the end
And is_page_allocated is said bitmask czech
you could probably use fancy bit intrinsics to do funny stuff there too
both on the CPU and GPU there are dedicated instructions for stuff like bit count, or getting the first bit that's a 1 or the last bit that's a 1
Ah, __builtin_clz
so you can probably check 32 pages at a time if you're smart
or 64, not sure what you're targeting
your CPU system or GPU system
CPU
Maybe I didn't mention this, but this allocator should run on the CPU ye
I got a big brained idea
__builtin_clz is exactly what I need holy shit
const auto free = __builtin_clz(~bitmask);```
or not
eh, nah rip
it's not what I need
there must be a fancy instruction that returns the first 0 bit
here's a higher level simulation
from what I read just now, beware that ctz/clz are technically undefined when x = 0
so you might need to be careful with it
Ok clz is definitely what I want
Suppose 0 is allocated and 1 is free, __builtin_clz(0011) returns 2, __builtin_clz(1000) returns 0 and __builtin_clz(0000) is undefined (good, it means it's completely full)
epic
undefined means it'll return whatever and you can't depend on it
so you have to check it separately
yeah, I'll just add a separate checc
so then
So the while loop before becomes bogus if I can check 64 things at a time
for each mask in list {
if mask == 0 { continue; }
const auto free = __builtin_clz(mask);
allocate_page(free);
break;
}```
I wonder if there's SIMD versions of clz
ah yes
loops? what are those
I only know 8192 bit wide vector instructions
int mm256_lzcnt_si256(__m256i vec)
{
__m256i nonzero_elem = _mm256_cmpeq_epi8(vec, _mm256_setzero_si256());
unsigned mask = ~_mm256_movemask_epi8(nonzero_elem);
if (mask == 0)
return 256; // if this is rare, branching is probably good.
alignas(32) // gcc chooses to align elems anyway, with its clunky code
uint8_t elems[32];
_mm256_storeu_si256((__m256i*)elems, vec);
// unsigned lz_msk = _lzcnt_u32(mask);
// unsigned idx = 31 - lz_msk; // can use bsr to get the 31-x, because mask is known to be non-zero.
// This takes the 31-x latency off the critical path, in parallel with final lzcnt
unsigned idx = bsr_nonzero(mask);
unsigned lz_msk = 31 - idx;
unsigned highest_nonzero_byte = elems[idx];
return lz_msk * 8 + _lzcnt_u32(highest_nonzero_byte) - 24;
// lzcnt(byte)-24, because we don't want to count the leading 24 bits of padding.
}``` oh god what the fuck is this
looks weird
mfw gcc can't auto vectorize this shit because of control flow
alright that's enough bikeshedding the most optimal of clz instructions 
I shall remember this for tomorrow™️
is_allocated(allocator, page) {
index = page.index / 64;
bit = page.index % 64;
mask = allocator.list[index];
return !(mask & (1 << bit));
}
deallocate(allocator, page) {
index = page.index / 64;
bit = page.index % 64;
mask = allocator.list[index];
mask |= 1 << bit;
}
allocate(allocator, page) {
for (mask in allocator.list) {
if (mask == 0) { continue; }
index = 63 - __builtin_clz(mask);
mask &= ~(1 << index);
page.memory = allocator.memory;
page.offset = ...;
return VkSparseMemoryInfo(...);
}
}
for (req in requests) {
if (req == REQUEST_PAGE_NEEDED) {
if (!is_allocated(allocator, pages[req])) {
updates.emplace_back(allocate(allocator, pages[req]));
}
} else {
if (is_allocated(allocator, pages[req])) {
deallocate(allocator, pages[req]);
}
}
}```
This looks very promising
bit ops, how lovely
that looks good to me
nice, unfortunately I won't be using that 😦
Or well
all is to be seen
so good thing you confirmed it was good
@raven orchid you may wanna invest in bitops too
ok so, my allocator's performance was not very good
I was originally using a first fit, free list allocator (for some goddamn reason, no wonder it was slow) to manage my virtual pages
turns out you can do a much better thing
you use a big ass bitmask to remember which pages are allocated and which are not, then when you allocate you simply go through the bitmask and find the first free slot, using a CPU intrinsic
deallocating is even easier, you take the index of the page within the bitmask and the index of the bit, and you set the bit back to 1
now this, is blazing fast
like nanoseconds blazing fast
Ohhh now I see so this is specifically optimizing the problem of
How to find the next free slot fast
yes exactly
make it branchless
I dunno how to make that branchless 
make what specifically branchless
you can make all the aggregated bit checks branchless probably
this ig
hmmm
you just need to do
actually nah
idk
I was thinking you could use a special sentinel value like (mask == 0) * sentinel + (mask != 0) * __builtin_clz(mask)
but you'd just be deferring the branch
https://en.wikipedia.org/wiki/SSE4
uhhh
there seems to be something for LZCNT and BSR
No I fell into the caching rabbit hole
But I did increase the page group size to 32x32
wot is this
nvm I think the scalar lzcnt came in SSE4, not that there's an SSE lzcnt
that's a rip
Current is still that a page group just maintains a min bounding box around invalid non cached pages and checks against that only
SSE4 is one of the later x86 isa extensions
rip
For the physical memory itself I lost track
Did you decide on a “never actually free full” strat or do you release memory sometimes?
if you had a 32 core avx512 CPU you could do an allocation check in 1 cycle
the current idea is to deallocate a page block when it's empty for a number of frames
if only...
weird way to say: "buy a fucking threadripper scrub"
simply buy one of those
Another question
Could your current alloc strategy be partially moved to the GPU? Like could compute essentially select the next free pages atomically and just write that as work for the cpu to
if you can do it with this simd crap you could totally dispatch a compute shader to do this same operation
I guess I’m thinking right now it’s probably
Gpu: I would like a page
Cpu: I will find next and alloc
Vs
Gpu: I need you to alloc this exact page
Cpu: ok
it'd be cool too
yes this is totally doable on the GPU now
the only thing I have to readback is the indices of the virtual pages I should update and the offsets into the device memory it should reside in
the only problem I see with this is when the block is completely full
But I could issue an extra dispatch for that no prob
Unless the gpu could somehow just do some kind of virtual block strategy
And the cpu would alloc the block followed by the actual requests lol
man the world would be a better place if the GPU could allocate memory for itself
it'd be cool if the CPU had some kind of stable handle to GPU stuff so you didn't have to readback
Yeah 😦
vkCmdAllocateIndirect
how long does UE5's impl take?
UE5 does it differently
But their code is so convoluted that I'm not basing it off UE at all
makes sense
clz is part of the standard now btw.
what's the standard one look like?
ye I use that now
std::countl_zeros
page table HZB is such a ridiculous idea
Impossible, it can't be done
watch me do it anyways and fail miserably 
i believe lvstri
I don't
at least sync works now
timeline semaphores are fucking awesome
man this is so incredibly tedious
as it turns out when your entire renderer is hybrid sw/hw compute/mesh shader it isn't exactly trivial adding stuff 
literally me rn
ikr 😄
perfectly depicts my current struggle 
yeah I'll leave this for tomorrow me
surely tomorrow me will be far more intelligent
sounds like my motto
Don't worry, I'll finally be at my computer tonight and my implementation will be perfect
Inshallah
I am extremely unhappy with how I wrote my shaders, I kept thinking "eh this is fine for now"
Well, it isn't fine anymore 
and I have also determined that HLSL still isn't as good as a drop in replacement as I'd like it to be
so good old GLSL it is
horray
Actually scratch this
I'm not convinced
GLSL sucks
sucks so bad
How much can I hide behind convenience functions the inline spirv?
what dont you like about it
functions
there are a million restrictions on functions, plus there are no templates which would come in handy for some stuff in my compute rasterizer
it's hard to generalize something when functions suck
since i use bda and texture handles i dont really mind it mich
I do too, but sometimes just using functions is nicer
take a hypothetical check_hzb function
struct X {
// can't have this in a struct in GLSL
sampler2D hzb;
// other data
};
struct Y {
image2D hzb;
// other data
};
// here a template would be useful for the core part of the algorithm, taking in both X and Y
template <typename T>
vec4 get_projected_aabb_uvs(in T x) {
...
}
bool is_occluded(in X info) {
// regular HZB sample check
...
}
bool is_occluded(in Y info) {
// different things, page table HZB check for example
...
}```
well i have all textures virtual with handles
all this would be possible in HLSL as far as I'm reading
so this is not a problem for me
the makros are pretty simple now
i generalized them back
so they arent typed anymore
just the accessor makros are
in shaders i mostly dont feel the need for hlsl too much yet
i started to dislike bda
because i dont have boundschecks
it just explodes
super annoying
so i wanna try hlsl byteaddressbuffer
ehh it's not too bad
oh yeah that's nice
I kinda want to YOLO it and move to HLSL no questions asked
for example for rt i will use hlsl
cause nsight doesnt understand bda
so i couldnt debug it
ye but DX feels shite to use
plus windows.h
I dunno about DX, maybe I'll try it sometime in the future
but it's yucky
oh for sure
I'll probably never switch to D3D12 even if it turns out to be the greatest
Vk works fine for me
one of the few nice things in hlsl is that you can have multiple entry points in a file without makro spamm
sometimes its also quite painful without templates
also having member functions is also nicer to txpe
the syntax is much better in hlsl
TODO: Read this carefully
