#Iris - A Journey through OpenGL and beyond to learn Graphics
1 messages · Page 7 of 1
assert your dominance by shilling your lib to him
I think I just had a huge brain moment
Actually nevermind
It's view dependent so it won't work
goddamnit
pure sadness
I thought marking the biggest primitive in the cluster and checking only that would be smart, but it isn't

since meshlets are disconnected, I can't guarantee that such primitive will be in view when others are
What does this meeean
How can I do this in a way that doesn't take all my goddamn compute power
you'd think just the AABB extents would give you a close enough read on that
yes but they won't, because meshoptimizer prioritizes topological efficiency to locality efficiency
Which means clusters are very much welcome to have disconnected triangles that span the entire mesh
as long as zhe aabb is smaller then 32 in any direction in screen space pixels
this is good for low density meshes kinda
I guess that's the most I can do right now yes
oh I get what you mean, but I have a feeling that meshopt's clusterizer doesn't work like nanite's
and I think zeux's opinion on locality only holds if you're not making a hybrid hw-sw rasterizer
Yes I get that, this all translates in "I have to make my clusterizer that matches my requirements"
I have no idea how to though 
you could look at nanite's, or start simple
maybe try converting your meshes to triangle strips
and then subdivide those strips into equal pieces
that should give you some kind of locality
I'll try to find Unreal's clusterizer, their folder structure is kinda insane 
Wouldn't something like octree give you locality?
Possibly, I'd have to experiment
These are all the software rasterized clusters
For some reason I don't understand there are clusters that are way too big
Ah I wasn't abs'ing properly
Now it's good
Not gonna lie it's kinda cool to see it work in practice, even if 50% of the potentially software rasterizable clusters are hardware rasterized...
Looks pretty good
(still no culling btw)
With everything enabled it's absurdly fast
You can kinda see when the rasterizer chokes because of small tris, for 100 microseconds or so vertex occupancy is great but pixel occupancy is very low, then for the next 200 microseconds everything is good
Frognite
Is this iris thing somewhere in github?
check pins
da peeens
so
here's Sponza
very cute huh?
except uh
Perhaps I subdivided it a bit too much

you got that signature nanite look though
GI is terrifying
you're right
says the person who just reshrimplemented Nanite
that's what my dad's gastroenterologist said
I implemented only the easy part 
The hard part is the LOD DAG
yes
he has a little video series on his discord explaining it
btw I'm curious to check if I effectively wasted my time, or if this is actually better
so I'm packaging this high poly sponza™️ for you guys to test
oh wtf so he actually implemented his own meshleting LOD
I'm hyped to see if it works on my 760
hyped to see if I even have the video memory for it
yeah
my 2x780ti were able to run devsh's LOD thing with 25mio tringles iirc, 1.5years ago
only if it runs on lunix too 😛
Hmm, can I see what's taking so much time
In NSight
Can nsight generate this for older gpus?
I sometimes get “not supported on this hardware” messages with parts of the profiler
Actually yeah, I forgor that GPU trace only works with new GPUs
You should still have the frame profiler though
yeah frame profiler should work
dang wish the full profiler worked 😦
for this mesh did you subdivide it in your thing and then export or how did you gen?
Regular catmull-clark subdivision
I had the gpu trace option available in older versions of nsight, but I never tried it with opengl
I downloaded some library
I think demongod said it doesn't work on his pascal gpu though
yeah with opengl it gets mad for some reason
also make sure you run nsight as admin or sudo
well it gets mad for 2 reasons: A) it does not like pascal
that's another classic
B) does not support anything but DX12 or VK
gpu trace is good enough for me 90% of the time
e.g., if it says you're vram bottlenecked, it's generally quite obvious what lines would impact that
yeah true, usually you can
through the code for that stage and guess the bad parts
register usage seems kinda high, but I'm not sure how to interpret the graph exactly
oh this is my own thing lol

what mesh are you drawing
smh doesn't even recognize his own brainchild
The ultrahighpoly sponza
25 million
Ah yeah, it's very little
All I can say is that the RSM pass is the only thing I attempted to optimize
The rest is just a very naive deferred renderer
25 draw calls btw
Lol the sample RSM pass is super L2 bound because it's doing random samples
I tried to open it in blender and i don’t have enough ram for that lol
hehe, yeah you need too much RAM to do stuff in blender
what resolution btw
do I have your permission to implement my thing in frogfooding
does that require mesh shading?
no
good old vertex shader
I am planning to switch back to vertex shaders as well
I like being able to run on more than 0.01% of systems 
just meshopt
yeah you can add it if you want
though
it's in very early dev atm
I didn't originally plan frogfood as a collaborative thingy, but this feature seems lit enough that I'm willing to try
Epique
does your thingy require 64 bit atomics or neh
If I were able to write to D32 attachments from compute life would be so much easier (and no 64 bit atomics either)
you can write to a D32 attachment and then copy
oops I mean R32
it's ugly, but works
How do you merge 2 D32 attachments?
wdym
Uhhh
It could work?
I mean I need to store meshlet_id and primitive_id
so how would I update the meshlet_id and primitive_id only if the depth test passed?
I could yolo it like this:
const float depth = uintBitsToFloat(imageLoad(depth, position).x);
if (depth > current_depth) {
// YOLO
imageStore(depth, position, current_depth);
imageStore(visbuffer, position, meshlet_id << 24 | primitive_id);
}```

ship it

what we need is EXT_compute_shader_interlock 
The only thing that's kinda meh with the old vertex shader approach is the preprocessing step
It takes a good third of the raster time*
Perhaps one could cache this
Reprocesses all vertices each frame?
Yeah, it builds an index buffer with meshlet_id << 7 | primitive_id & 0x7f
chonky
The shader's super easy as well
shared uint base_index[MESHLET_PER_WORK_GROUP];
shared uint base_primitive[MESHLET_PER_WORK_GROUP];
shared uint primitive_count[MESHLET_PER_WORK_GROUP];
void main() {
const uint meshlet_base_id = gl_WorkGroupID.x * MESHLET_PER_WORK_GROUP;
const uint meshlet_offset = gl_LocalInvocationID.x / 64;
const uint meshlet_id = meshlet_base_id + meshlet_offset;
const uint local_id = gl_LocalInvocationID.x % 64;
const uint index = local_id * 3;
if (meshlet_id < meshlet_count && local_id == 0) {
const meshlet_t meshlet = meshlets[meshlet_id];
base_index[meshlet_offset] = atomicAdd(o_command.index_count, meshlet.primitive_count * 3);
base_primitive[meshlet_offset] = meshlet.base_primitive;
primitive_count[meshlet_offset] = meshlet.primitive_count;
}
barrier();
if (meshlet_id < meshlet_count && local_id < primitive_count[meshlet_offset]) {
o_meshlet.indices[base_index[meshlet_offset] + index + 0] = (meshlet_id << 8u) | (primitives[base_primitive[meshlet_offset] + index + 0] & 0xffu);
o_meshlet.indices[base_index[meshlet_offset] + index + 1] = (meshlet_id << 8u) | (primitives[base_primitive[meshlet_offset] + index + 1] & 0xffu);
o_meshlet.indices[base_index[meshlet_offset] + index + 2] = (meshlet_id << 8u) | (primitives[base_primitive[meshlet_offset] + index + 2] & 0xffu);
}
}```
#define PRIMITIVE_MASK (MAX_PRIMITIVES - 1)```

not too bad
how many ms does that save?
that is why you shuold write about all that
and make it available for others outside our GP bubble
How does nanite do lods? For some reason I didn’t think they were anymore
the LODs are like the bread and butter technically speaking
It's clamplicated
Given a list of N clusters, they build a DAG where the leafs are the most detailed LODs
nanite lodding is like 66% of the point of nanite
virtualized geo makes it possible to draw so much
Yeah I guess when I hear lod I think chunky stuff from older games
its really complex, especially the hierarchy building is very complex and involved
nanite cal dynamically lod parts of obnjects even
Then for each LOD level until we reach the root, we take M clusters (let's say M=4 clusters) with the most shared boundaries and shrimplify them, the shrimplification happens such that the resulting cluster will be half the triangles of the previous cluster, after that we split the cluster into two clusters and they will be the parents
finally someone sees it
We simplify clusters in groups because otherwise we would have cracks where cluster boundaries don't match, so we try to group clusters with the most shared boundaries, and leave the outer boundaries of the group unchanged
This is repeated until we reach the root which is always MAX_PRIMITIVES triangles
How to choose the correct cut for the DAG is even more clamplicated, it's based on screen space projected error using a quadric error metric
yes, which is very suboptimal 
But honestly the error thing flew so high over my head that I'm thinking of not implementing that 
Brian Karis himself said it took over one year of full development just to get the error metric right
Oh yeah 1 man year he said
its supposed to be simple to be implemented
otherwise they wouldntve said "simplify"
I was thinking of doing some good old shading
lighting, shadowing, real time SDF based global illumination
But for shadows I would need to call the whole Frognite™️ (patent pending) pipeline N times
So multiview?
Do I just dispatch more workgroups (for software)?
And I guess I could just use native multiview for mesh shaders
what does it do?
It expects the normal map to be transcoded to BC1_RGB
so it doesn't swizzle the green channel of the normal map to its alpha channel
can't wait for your release of 2gltf2pack
after you implement Frognite™️ in Frogfood®️ ofc
I do plan on releasing Frogmen™️ too
(lumen but with frogs)
SDFDDGI caught my eye
more like Lufrog
Froglight
Before I'll make ridiculous breaking changes
Here's some stupid N dot L * base_color shading 
zeux says "it doesn't take into consideration topology very much"
Which is exactly what I need 
I wonder how many curses I will get from Khronos for integrating DirectX tools with their pristine Vulkan™️ API
I also need to start thinking seriously about multiview
I am not sure I can use VK_KHR_multiview because I don't bind any color attachments whatsoever in my entire pipeline
(except for output to swapchain)
typedef struct VkRenderPassMultiviewCreateInfo {
VkStructureType sType;
const void* pNext;
uint32_t subpassCount;
const uint32_t* pViewMasks;
uint32_t dependencyCount;
const int32_t* pViewOffsets;
uint32_t correlationMaskCount;
const uint32_t* pCorrelationMasks;
} VkRenderPassMultiviewCreateInfo;``` I'm really not sure how to use this 
i also noticed something fuhny
our pluginshinanians exposed quantization via gltf, but the plugin itself has some quantificitione options already
Does blender document their quantization shenanigans
gltfpack is open (blender is too but good luck navigating the 10 million loc black box of stuff
)
i have not checked tbh
and its the gltf plugin i was talking about not blender itself
Ah it's a plugin
By the way
Some implementations may not support multiview in conjunction with mesh shaders, geometry shaders or tessellation shaders.
This is very sad 😭
fook
Why would you not support multiview with the feature that literally needs multiview more than any other feature 
non ho idea di cosa sia il multiview
basically fancy gl_Layer
ah
if you make a framebuffer with layers, you can draw to each layer by writing to gl_Layer
multiview gives you gl_ViewIndex and it's read-only
yeah, rather than employing a gs
You set a number of views you want to render to when you setup the thing and then all the draw commands are broadcasted to each view
Nanite does this to render all shadow maps, for all lights, in all viewports simultaneously
oof that sounds neat
something i need too at some poitn for my shadowisms 🙂
or lightprobes perhaps?
Ah so that's how they do vsm
glMultiDispatch
It would be cool if we could schedule barriers from the GPU
True but I don't think I need barriers
Perhaps I could shrimply put the unused y dimension to use
or the hidden w dimension that driver devs don't want you to know about
The only issue I have is with memory
but it's only a couple million uints
so a couple megabytes at worst will become a couple hundred
With 100 multiviews
course I do it's extremely fast
HZB build + cull + classify doesn't even show up in gputrace right now
even for 100 views?
Combined is less than 50 microsecs
waw
Well uh
ah
For 100 views you could estimate 5 milliseconds just to cull
But 100 views is just a huge upperbound, I think unreal supports up to 128
But they cache their whole pipeline/scene and make it persist across frames and more shenanigans beyond my comprehension 
ouf
You really should fork gltfpack. Most of the processing is in meshoptimizer, gltfpack is essentially almost "just" a wrapper for meshoptimizer.
Or even file issues / pull requests
TODO
Test MaterialID depth buffering:
- Rasterize visbuffer
- Fullscreen triangle, load visbuffer per pixel and write
gl_FragDepth = uintBitsAsFloat(material_id);, depth test set to ALWAYS - Draw more fullscreen triangles, one per material, depth test set to EQUALS
not sure if you are into bundles, but https://www.youtube.com/watch?v=6_BBgz5-H20
The Unreal Engine Mega Pack is a huge collection of high quality 3d environment assets.
https://www.humblebundle.com/software/unreal-engine-mega-pack-software?partner=gamefromscratch
This pack from Hivemind contains thousands of 3D objects and blueprints for creating a wide variety of maps, including Viking, Medieval, Harbours, Churches, Houses...
for the triangle counts 🙂
heh
twas a success pog
@frank sail
I drew a lil something
Feast your eyes upon my huge drawing skills
This is with regards to SMRT
As far as I've understood at least
reads like al hamdu lillah
heh, you havent seen my handwirting
white text is "Sun"
blue text is "Hit"
green text is "March until hit"
ah lol
this diagram works for any ray traced shadow tbh
How does this produce contact hardening though? 
Also you seem to define something like a "heightmap thickness" which I have no idea where it comes in
o I think I got it
the farther away we are, the less likely we are to hit any object?
Father away means the ground-to-light ray has more time to diverge before hitting a blocker
Idk if you've played counter strike, but you might know in shooter games that peeking when you're near a corner will reveal the enemy more quickly than peeking from far away
well that's just a heuristic
the depth map doesn't tell us the true geometry of everything behind it, so we have to guess somehow
so we just say the depth map is a solid wall of N thickness with absolutely nothing behind it (except for the surface we're shading)
So instead of the tree we have a huge wall
unreal uses some additional heuristics that are suggested by console commands
From the perspective of the ray at least
yeah, it's a wall with the outline of a tree
I'm testing SMRT in unreal right now lol
This is SMRT with 1spp and 1rpp
pretty blocky
btw the UE docs don't cover the new SMRT console commands
they remained the same since the 5.0 release
Now the obligatory question
What are the disadvantages of SMRT?
To me it looks like free contact hardening lol
Also what happens when the blocker is not in this cascade 
it has light leaking
idk 
you ought to be able to trace within multiple cascades I guess
damn
Tracing within multiple cascades sounds baad
Let's say we got 8 shadow rays
8 steps per cascade
On average we can assume the blocker has a 50% probability of being in the current cascade
what I mean is just switching to a different cascade when you go outside the bounds of one
hm
the shadow ray is typically pretty short (even when the blocker is far away, you just teminate the ray), so it's unlikely you'll ever go between more than two
at least according to my mental heuristics (I haven't actually implemented this with >1 cascade
)
How do I know when I'm out of bounds though 
shadow_clip_pos.xy >= 1.0 or <= 0.0?
Or maybe when the ray is above the shadow map?
yeah I guess
btw do you do entity cvulling on the gpu?
manual barriers
isnt that just culling meshlets?
not full entities
im looking at cull_classify.comp
Ah right now it's not the latest version
But yes, I am culling meshlets
meshlet instances
but not entities before that
What do you mean by entity?
i was gonna ask how if you did
because its actually annoying as fuck
because you have an asymetric work expansion from mesh to meshlets
each mesh can return a different count of meshlets
so you need to either do an indirect draw count, where you cull in the vertex shader and discard all vertices, starting MESH_COUNT draws each having an indirect draw with PER_MESH_MESHLET_COUNT as the number of vertices
or you do compute work expansion
by doing a prefix sum and then binary search
but soon tm we may get work graphs which solve this issue btw
with them you would just dispatch meshlet culls from the mesh culls
making its ultra simple and efficien t
ALSO WHY IS THERE NO DISPATCHINDIRECTCOUNT
😿
@frank sail give it to me
Wat
i need multi dispatch indirect
??
why do you want that
for the same reason i would want draw indirect count
just increase the size of the dispatch
no
unless you also want indirect global barriers and such
if we get work graphs i wont care at all i wont use any indirect anymore only work graphs if they dont suck
ok how do i map the indices
prefix sum and binary search is the most efficient way aside from draw indirect count abuse
idk what problem you're trying to solve so idk
i need to map from global thread index to meshlet index and mesh index
each mesh not culled has different counts of meshlets
i need to iterate over the surviving meshes meshlets
so i cant just use the thread index to index meshlets
can you explain how you would use multi dispatch indirect
and how that is not mappable to regular indirect dispatch
if you need ordering, then you're basically asking for indirect barriers and command submission, which I agree would be cool and useful
- i would make a buffer containing n dispatch indirect structs
- each containing the meshlet count / workgroupsize rounded up as the x parameter, 1,1 for y and z
- i would populate this in mesh culling, each surviving mesh appends to this buffer filling the dispatch values
- then dispatch indirect count over the array of dispatch infos, each working to cull meshlets for a mesh
this can be done with task shaders as well btw
vkDrawMeshletsDispatchTasksCount or whatever
but it has the stupit shit with setting up drawing and all that for no reason
how would you map global thread index to meshlet and mesh index
if you have n meshes
each have m[N] meshlets
this is why we have draw indirect count
You could probably have a buffer with the counts of each meshlet and divide the global ID by some upper bound
what if you did an indirect dispatch where you use Y or Z to indicate how many meshlets you produced or whatever
Not sure if this would work
wildly different sizes
its super slow like that
so you would get like 95% of the grid wasted
one way to fix this
is to simply not be gpou driven
Unacceptable
and record a dispatch per mesh, and then use predicates of whatever they are called
to cull
but that is actually much slower
as you need an extra dispatch for all draws
I'm sure the issue is trivially solvable with another level of indirection
for each nonculled mesh
then i binary search for each thread in a fat dispatch
if they find two counts they are in between
Each thread does binsearch?
they found their mesh
and can then subtract that meshes prefix sum of their id
to get meshlet id
aha, that's genius
🤓 👆
But isn't binsearch for each thread slow as fucc
i had to ponder the orb for that one
ultra fast
at least for small entity counts
another solution would be to have different expansion rates
so inside the mesh cull shader you do
- test mesh
- if meshlet count < 128 append to 128 buffer
- if meshlet count < 512 append to 512 buffer
- ...
then later you dispatch for each of these buffers the number of entries in x, y is the multiplication to get to the buffer count from workgroup size
This sucks
this will probably waste around 70% worst case or so
I like binsearch after all 
if you do it power of two steps, it will be at most 50% ignoring anything under 32 or what ever warp size is
i believe it is the best way to combine them
so have an if on massive meshes
like idk > 1024 meshlets
Ye prolly the best
and put them in the buffer for big dispatch or so
but its so much work and so stupit
GIVE ME DISPATCH INDIRECT COUUUUNNNTTTT
it is actually more efficient to do a draw indirect count and cull in vertex shaders im pretty sure
just to not do all the shit inbetween
I really fail to understand why there is no indirect count for dispatch
such a basic thing
because nobody needed it 
Well mr potrick needs it now (and I will be as well in the near future)

So we'll be raiding Khronos HQ
whip out the copium bois
i believe it woul be very easy to implement in my naive world view
if we get workgraphs this is all not important
they are indirect on all steroids at the same time
Workgraphs would solve so many issues with GPU driven it's crazy
yes
did someone already mention doing a bunch of DispatchIndirect (up to a fixed max) on the CPU and letting the GPU populate each one
truly one of the GPU-driven strategies of all time
yep
.
i think its actually how it should be done if i wasnt full gpu driven
why do you need predication
i heard from the mountains that some vendors like it over 0 dispatches
The olympus gods
can't you treat it like MultiDispatchIndirect from the GPU side (no count)
i guess 0 is fine
would be cool to loop
omg give me the command processor
i will programm it
😟
abandon vulkan and become an amdgpu main
oh, so you want to replace fixed-function bits of the hw? 
While you're at it expose the whole memory subsystem, so I won't need CPU readback to update gpu mem pages 
ok ok listen to me:
i prerecord a command buffer with 1 million dispatches or osme other high number, the nhave predication aroiund every 100 or so.
Then i fill them as i need enabling predicates to unlock more dispatches.
I the nreuse that cmd buffer every frame
i use my iron
the gpu already has virtual memory and can load pages from cpu ram
ok I thought you were talking about query objects
TODO: look at SDF tracing and probe tracing
Voxels scare me
Actually any kind of data that is supposed to be stored in a data structure that's not a simple ass array scares me 
@wicked notch I haven't done SDF tracing yet
for probe tracing are you talking about
its about time you do 🙂
is dis yours
yeah I think SDF is used for things like Godot 4
I believe the data structure is baked though
oh dang so they're not redoing it
probes tend to be baked though
I think
from what I remember for things like Division 2
I think their probes end up using sort of offline preprocessing so that each one can cache which surfaces they can see
oui
luschtri asked to have dfdx(worldpos) vischuellized
It looks super kewl
yeah
disco bounding lines
im surprised you dont try to sell me "you need fsr2"
: >
I was doing an experimentationes with dFdx
You can transfer the knowledge I gained to frogfooding btw
mat3 TBN = mat3(0.0);
{
const vec3[] world_positions = vec3[](
vec3(transform * vec4(positions[0], 1.0)),
vec3(transform * vec4(positions[1], 1.0)),
vec3(transform * vec4(positions[2], 1.0))
);
const vec3 ddx_position = analytical_ddx(derivatives, world_positions);
const vec3 ddy_position = analytical_ddy(derivatives, world_positions);
const vec2 ddx_uv = uv_grad.ddx;
const vec2 ddy_uv = uv_grad.ddy;
const vec3 N = w_normal;
const vec3 T = normalize(ddx_position * ddy_uv.y - ddy_position * ddx_uv.y);
const vec3 B = -normalize(cross(N, T));
TBN = mat3(T, B, N);
}```
Here how I do TBN now
No tangents required
vec3 analytical_ddx(in partial_derivatives_t derivatives, in vec3[3] values) {
return vec3(
dot(derivatives.ddx, vec3(values[0].x, values[1].x, values[2].x)),
dot(derivatives.ddx, vec3(values[0].y, values[1].y, values[2].y)),
dot(derivatives.ddx, vec3(values[0].z, values[1].z, values[2].z))
);
}
vec3 analytical_ddy(in partial_derivatives_t derivatives, in vec3[3] values) {
return vec3(
dot(derivatives.ddy, vec3(values[0].x, values[1].x, values[2].x)),
dot(derivatives.ddy, vec3(values[0].y, values[1].y, values[2].y)),
dot(derivatives.ddy, vec3(values[0].z, values[1].z, values[2].z))
);
}``` With this
is that from how to reconstruct normals out of thin air?
No but that's the next step 
heh
I could actually calculate normals analytically right now
i rember there was a blog flying around wrt to that, recently
It's just normalize(cross(v[2] - v[0], v[1] - v[0]))
cheeky
So the vertex data becomes just position and UV
And we can quantize both of them perfectly 
Road to 0 byte vertex format
dang that's pretty cool
hehe
rip smooth norballs
this is
2.0
perhaps you can smear a little dithering over it, nobody will notice non smoof norbels
just hardcode the scene into your shader to make it free
powerplant.obj.vs.glsl
I think you mean, tessellate the mesh so much that multiple pixels do not share a triangle
that's how you get smooth face balls
That's the objective with Nanite anyways 🚠
Eckszacktly
I wonder if I could compute a gradient for smoothizing the normals
You have to get the neighboring faces too
we can't use subgroup ops in frag shaders right? 😦
Which means you pass a half edge structure to the GPU 
There are certain subgroup ops that you can use
Like the quad ones
I think that's it
Is there no WaveReadAcrossQuadLaneX or something like that
yeah there is
ctrl f quad
subgroupQuadBroadcast what a shit name

I thought this was a "write" operation, not a read one
(facepalming at the name, not you)
It actually broadcasts an FM radio signal when you call it
so you can tune in?
Ye so we can look at dem quads
I prefer triangles
did you end up figuring out the dFdx thingy
yes
see here for results
ridiculous 

400 microseconds in total for bistro with sampling and all
At this point I should begin optimizing the memory bandwidth of the GPU because that's my bottleneck 
Multiview is coming soon
You'll port that to #1128020727380054046 right 
multiview ain't supported on GL 😭
you'll port all this to #1073361699651989584 right
It's all open sus
Isn't there an ext or am I tripping
nono i need more users
Though I have no idea if any other vendor silently supports this ext 
Holy shit it's supported
Wdym
Incredible
brainworm 3.0 unlocked
You know what to do now
there is a OVR_multiview2 too
OVR = OpenVR
Yes, study for my upcoming exam 
i difuger
You go ahead and implement all the PBRisms
you can study tomorrow evening
bad parenting deccer 
The basic idea is that when light hits a surface it bounces off in certain ways, for example diffuse. At some point someone realized we can approximate that by spawning point lights on or near surfaces where light directly hits it
Then single or multi bounce lighting just becomes spawning virtual lights around the scene
Which I think can somewhat be related to probe based lighting too
Hmm
For that instead of VPLs they spawn probes
And probes capture light info for regions of the world
that seems pretty neat
re ovr_multiview, just found that while looking for the txt file https://forums.developer.nvidia.com/t/gl-ovr-multiview-performance-on-rtx3000/184313
The gains are from avoiding expensive barriers and state changes
If you are doing basic things with no barriers multiview is unlikely to make a diff
typical
akshually in this case it's oculus vr 🤓 ☝️
unfortunately both have been shortened to ovr
pranked
tbf that application is basically a worst case for multiview
going from singleview to multiview is only going from 2 drawcalls to 1, so there's barely any reduction in cpu overhead, and the points only have a single colour attribute that is passed straight through in the vs so there's no vertex work it can share between the two views
Does it work?
I am considering getting it but don't want to waste money 
success
It's not a very high poly scene though, just 20 million instanced triangles
what does instanced mean in this ocntext?
are those wall towers the same mesh and those have been instanced?
yes, the trees too
It's also just 34MB lol
it doesn't have any textures sadly
Perhaps Unreal's FBX exporter is unable to export textures?
triangle dust 
looks ok to me
are you sure you loaded them not as srgb 😛 (i know there are no maps yet)
uhm
this mesh turns my "modelviewer" black and imgui wont show up either 😄
4:54:54
i doublt i can even take a capture 😛
Perhaps you could display the instance ID or the primitive ID
here's the color func I use
vec3 hsv_to_rgb(in vec3 hsv) {
const vec3 rgb = saturate(abs(mod(hsv.x * 6.0 + vec3(0.0, 4.0, 2.0), 6.0) - 3.0) - 1.0);
return hsv.z * mix(vec3(1.0), rgb, hsv.y);
}
You use it like this
hsv_to_rgb(vec3(float(gl_PrimitiveID) * M_GOLDEN_CONJ, 0.875, 0.85)```
because my code is shit presumably
incredible
[16:55:33 DBG] SharpGltfMeshLoader: Loading Material MI_Fountain_Water_Inst
[17:00:08 DBG] SharpGltfMeshLoader: Loaded 46961 primitives from /home/deccer/Personal/Code/Projects/lessGravity/OpenSpace/src/OpenSpace.Main/bin/Debug/net7.0/Data/Props/bazaar.glb
``` 4.5min 😄
ok this is weird, i has finished loading everything, but screen is black XD
wtf
its busy creating the meshpool out of those 43k meshprims 😄
ok i have to work on that hehe
Ah you don't handle instancing 
Ah no need
it's not using EXT_mesh_instancing
It's just using regular gltf node instancing
ah
where multiple nodes reference the same mesh
i just iterate over all nodes
and meshes should be handled properly, i believe my deccer cubes work the same way
thanks for this fucked model : > to show me how shit my code is
na, my code is also actually shit
too much memory copy bs and allocation shinanigans
ugh
when unreal engine
Yes, but I constantly bluescreed 
MEMORY_MANAGEMENT or some stuff
Turns out my CPU's IMC did NOT like 128GB
smh Jaker
fix your CPUs
fook
Could be a timing issue
Might need to decimate?
why did you buy 64 gb of ram before checking if your cpu can handle it 
TODO
Real Time Ray Tracing of Micro Poly Geometry with Hierarchical Level of Detail
Carsten Benthin, Christoph Peters
Paper Session: Primitives, Surfaces & Appearance Modeling - HPG 2023 - Day 2
learn how the fuck they managed this
Hold on

They just went: "alright no available API allows us to do efficient BVH rebuild what do we do"
"we obviously forgo hardware acceleration and just use Embree and make our own BVH!"
amazing
If they can make a GPU accelerated, dense BVH based on clusters
Why can't AMD or NVIDIA
bruh
QVL only looks at entire kits
which is why I meant could be a timing issue
Alright boys
it is time for one of my usual detours
Like last time with mesh shaders, we can all see it didn't turn out into anything serious
It's not like I'm in a rabbit hole 9km deep into meshlets
Totally not that
Anyways this time the detour will be RayTracing!
After the last exam is done, I will spend day and night learning aboud BVHs on the GPU (doing them myself, no VK_KHR_ray_tracing)
Then I will re-read the paper about nanite style RT LODs
and finally I will try making an issue on Vulkan-Docs to see how the big brains over at Khronos will receive it
the driver sink hole you will fall in gives me enough time to catch up again
i can draw again btw
i am slowly clawing back my power in the rewrite
I saw your impl of the entity culling btw, I think I get it now
Amazing ideas behind it
You guys remember the ballz
It's time to rewrite the raytracer, on the CPU with a proper BVH this time 
Nice balls homie, solid 8/10
void trace(bvh, origin, direction) -> color {
auto ray = { origin, direction };
const auto max_bounces = 32;
for (i = 0; i < max_bounces; ++i) {
auto hit = bvh.traverse(ray);
if (!hit) {
break;
}
ray.origin = hit.point;
ray.direction = random_direction_in_hemisphere(hit.normal);
}
}``` hmm
deep thought
does each primitive need to store an ID to the mesh it pertains?
So each BVH node will contain the ID of the primitive and the ID of the mesh?
the colors are nice, how did you pick them
I just went on the usual adobe color picker and choose something that looked nice
looks very vibrant
Also I wanted to mimic Sebastian Lague's layout so that I had a good reference image
probably not
hmm
you could also ask pixelduck about the BVH format on AMD
my dementia only allows me to remember a small number of things at a time
We're probably sharing braincells because I have the same issue
btw on AMD, an ID (often used for materials) is stored in triangle nodes
and each triangle node can store up to four triangles arranged as a fan (so only five total positions have to be stored)
Damn BVHs are heavy
The library I'm using stores bounding box (24 bytes) and 2 indices (8 bytes)
box nodes hold "pointers" (indices) to four children as well as their bounds, stored as f16 or f32 (so there are two kinds of box nodes)
Overall I can see how to send this thing to the GPU though
it's basically a flat tree stored as a vector
so I guess you're not doing hw rt
Not yet™️
I first have to understand the basics before I can trust the hardware to do it right 
tru
for more info, check the RDNA 2/3 ISA guides and search for IMAGE_BVH_INTERSECT_RAY
well it might not be that much more info
@wicked notch sir web wizard
<div style="display:flex;align-items:center;">
<input
type="checkbox"
id="{{component.id}}-checkbox-{{option}}"
name="{{component.id}}"
value="{{option}}"
[checked]="component.val.includes(';{{option}};')"
(change)="onCheckboxChange($event)"
style="width: 10%; height: 30px;">
<label for="{{component.id}}-checkbox-{{option}}" style="">{{option}}</label>
</div>
why the label nicely centered but the checkbox not
ok bruh the input was inheriting some css that messed it up
thanks previous devs
LVSTRI be working in the secret 25h h each day



Alright for now I'll shrimply store another indirection vector
purpose is mesh_id = ind[prim_id]
I see now why we have two BVHs 
you have a blas for each mesh right?
Ye that's the plan at least
so when iterating through the blases and traversing each one cant you remeber the index, just like you remember closest hit pos
Perhaps that would be best
Also my "mesh" right now is a single triangle 
I'll just try tracing this tringle for now
soon lvstri will be snatched by some big $GPUVENDOR where he is put in the basement to work on $TECH and we will never see/hear/read from him anymore
I've lost count of how many times I had to draw a tringle
But here we are again
with bonus barycentric coordinates
I decided that everything shall be world space for simplicity
is it srgb though?
now it is
put it in a shader
Parallel.ForEach(trasversals, trasversal => {});
i should be quiet, i cant do any gp really : >
also, wdym "parallelize traversal"? like you want one ray's traversal to be parallelized?
Perhaps it's just this library I'm using, but their BVH trasversal function isn't really thread safe
it does execute in parallel internally I think though
intersect() isn't marked const
so that could be why
it modifies internal state or something
what library is this
madmann is here btw
actually hold on
I'm dumb
yeah I'm dumb
the ray required for trasversal isn't const
but intersect is
what does the function return?
intersect? Nothing
actually can you just tell me what file it's in
it takes a function that is supposed to iterate over some primitives and intersect each one with the ray
ok so I guess the ray is mutable in case one of the callbacks needs to mutate it
rays also store tmin and tmax
so it should be perfectly fine to call that fn from many threads
yep
I am invoking UB
I wonder why I didn't doubt myself before doubting the lib
smh
Also fun fact, none of the internal usages of ray change its state apparently
Perhaps I'm missing something?
czech the exshrimples
oh ok so tmin and tmax are actually the min and max distance for the ray to travel
the example has this line which tells me that perhaps tmin and tmax get modified during traversal
https://github.com/madmann91/bvh/blob/master/test/simple_example.cpp#L98
so I guess it's sorta an inout ray param
close enough 😎
mayhaps duing tlasversar
why would you have the same ray traversing the scene multiple times in parallel
ya got a point
I wonder if loading in suzanne would be a good idea
Damn, 10ms
Not bad
Now I'll load intel sponza 
hmm how many tris are in suzanne?
if intel sponza has 1000x as many tris, I expect only about a 10x decrease in perf (assuming bvh2)
oh nice only 5x decrease
quite nice tbh
not bad
I'll go back to cornell box though 
this is cpu too
ye fully CPU
executor.for_each(0, height, [&](size_t start, size_t end) {
for (auto y = start; y < end; ++y) {
for (auto x = 0u; x < width; ++x) {
auto color = glm::vec4(0.0f, 0.0f, 0.0f, 1.0f);
for (auto s = 0u; s < spp; ++s) {
const auto u = x / static_cast<float>(width - 1);
const auto v = y / static_cast<float>(height - 1);
const auto uv_near = glm::vec4(glm::vec2(u, v) * 2.0f - 1.0f, 0.0f, 1.0f);
const auto uv_far = glm::vec4(glm::vec2(u, v) * 2.0f - 1.0f, 0.1f, 1.0f);
auto world_near = inv_pv * uv_near;
auto world_far = inv_pv * uv_far;
world_near /= world_near.w;
world_far /= world_far.w;
auto ray = bvh_ray(
as_vec3(world_near),
as_vec3(glm::normalize(world_far - world_near)));
auto bary = glm::vec3(0.0f);
auto hit = intersect(bvh, ray, [&](size_t i) {
if (auto hit = perm_prims[i].intersect(ray)) {
const auto& [b_u, b_v] = *hit;
bary = glm::vec3(b_u, b_v, 1.0f - b_u - b_v);
return true;
}
return false;
});
if (hit != -1) {
color = glm::vec4(bary, 1.0f);
}
}
color /= spp;
image[y * width + x] = encode_rgba(glm::vec4(as_srgb(color), 1.0f));
}
}
});```
Amazing
now do path tracing
soon™️
is this a parallel executor or something
ye
if it's 43ms fully CPU you could totally have it go realtime on your GPU
oke that's ebic
BVHv2's
I still gotta understand how to traverse the BVH on the GPU
but I'm slowly beginning to expand my brain mass
literally the exact same as traversing on the cpu in this case
because madmann is using the SmallStack thingy with a fixed size
what does that mean
as in, instead of
while (!stack.is_empty()) {
// traverse
}```
each thread has its own ray and does its own traversal and intersection
You do something fancier
and each thread has its own stack
Fair enough
the shader compiler generates a traversal kernel
the only thing the hardware accelerates on AMD is bvh node and triangle intersection
the actual traversal is just regular code
Very nice
using a "stackless" (actually fixed size stack) method
I guess trasversal is inherently difficult to parallelize
I mean, where would you even begin
Each step depends on the previous
stop thinking about parallelizing traversal 
each thread has its own ray to worry about
unless you have a scene with 10^100 triangles and literally only a single ray, parallelizing traversal doesn't seem very helpful
Perhaps work expansion could help
each work package does the trasversal for its own level
and dispatches more work for the next level
until leaves are reached
ok I'll stop now

actually that does seem kinda interesting for making memory access more coherent
but you'd have one dispatch per level of the bvh, and each dispatch would become increasingly incoherent as there are more nodes
prolly not worth tbhbh
I mean if the big brains at NV and AMD are doing it this way then it is not worth to think about just yet
anyways
I'm quite happy I managed to understand BVHs this quickly
I was expecting a more gruesome and bloody thing
what you're proposing sounds like distributing the work of a single ray's traversal to several threads. but you're already gonna have millions of rays, so you can shrimply have each thread compute one ray
Alright boys
poll time
Do I first shade this on the CPU or do I immediately start writing a shader
for learning purposes it's probably easier to start with the CPU
and you also don't have to use a shit (shading) lang while you're learning
How much do I have to pay you for you to make a good shading language btw
deal
(approx)
just write your own shading lang that will fix what's wrong with GLSL for real this time
I have no idea how to write languages
call it glsl 2.0
too bad shading languages still require some knowledge of graphics APIs
e.g., you still need a concept of resource binding
I know what that is fortunately 
if you're using cutting-edge vulkan, at least you can use BDA and descriptor indexing to shrimplify that stuff a bit
but I doubt you can make anything cuda-like without also providing your own API that wraps stuff nicely
I don't want cuda like tbh
I want to learn how to do RT in glsl
so I can use it in Iris 
And frogfood as well
I'm just saying that shading languages suck and cuda is a much nicer environment to use
ye true
and I'd like to be able to have something similar for graphics
you are at AMD
without vendor lock-in 
just pester some graphics engineer or something
you can use advanced tactics like:
guns
guns
more guns
intercontinental ballistic missiles (in case they escape)
the tf2 mercenary approach to persuasion
it do be
