#Rosy
1 messages · Page 23 of 1
bjorn switching to linux fulltime, you read it here first
anyway for memory management I'd prefer to stick to at least cudaMallocManaged as "default"
that absolves you from the need to do cudaMemcpy and such
it can be a perf footgun in a way because something might be sticking around in host memory while being very heavily hammered by the device but there's cudaMemAdvise to migrate it around
I still have to read the cuda programming guide
I've been focused on the optix side of things
it's got a full chapter on unified memory
given my current progress, unless there's some massive hidden whopper that I cannot yet see that's going to hit me right in the face with OptiX, there's likely no chance I'm going to go back to try to do graphics again with Vulkan. It's like 10x less pain with CUDA and OptiX. I made the right choice.
truly joever
right for right now anyway
ya
idc it's fun
going to uh try to make this suzanne appear now I guess
this is the only thing I'm unsure about when it comes to submodules
it seems like a pain to specify a specific commit
forking and tagging them yourself feels nicer, if you dont want to rely on the upstream repo
or you git gud
I just did:
cd external\fastgltf
git checkout v0.9.5
cd ..\..
git add -i
...
git commit
works?
you mean in cmake?
I guess if I wanted that in cmake I'd use FetchContent_Declare
but anyone closing the repo and checking out a commit would get the right versions of your git submodule depenedencies since it is committed in .gitmodules
FetchContent_Declare(
fastgltf
GIT_REPOSITORY https://github.com/spnda/fastgltf.git
GIT_TAG v0.9.0
)
message("Fetching fastgltf")
set(FASTGLTF_USE_CUSTOM_SMALLVECTOR OFF CACHE BOOL "" FORCE)
set(FASTGLTF_ENABLE_TESTS OFF CACHE BOOL "" FORCE)
set(FASTGLTF_ENABLE_EXAMPLES OFF CACHE BOOL "" FORCE)
set(FASTGLTF_ENABLE_DOCS OFF CACHE BOOL "" FORCE)
set(FASTGLTF_ENABLE_GLTF_RS OFF CACHE BOOL "" FORCE)
set(FASTGLTF_ENABLE_ASSIMP OFF CACHE BOOL "" FORCE)
set(FASTGLTF_ENABLE_DEPRECATED_EXT ON CACHE BOOL "" FORCE)
set(FASTGLTF_DISABLE_CUSTOM_MEMORY_POOL OFF CACHE BOOL "" FORCE)
set(FASTGLTF_USE_64BIT_FLOAT OFF CACHE BOOL "" FORCE)
set(FASTGLTF_COMPILE_AS_CPP20 OFF CACHE BOOL "" FORCE)
set(FASTGLTF_ENABLE_CPP_MODULES OFF CACHE BOOL "" FORCE)
set(FASTGLTF_USE_STD_MODULE OFF CACHE BOOL "" FORCE)
FetchContent_MakeAvailable(fastgltf)
and then
target_link_libraries(YourProject
PRIVATE fastgltf
)
then you can #include <fastgltf/fastgltf.hpp>
or with cpm
# this file would sit in libs/fastgltf.cmake
include(../cmake/CPM.cmake)
CPMAddPackage(
NAME fastgltf
GIT_REPOSITORY https://github.com/spnda/fastgltf.git
GIT_TAG 2483b87d19a67c91f2bffc386082001d3ea8bc07
OPTIONS "FASTGLTF_USE_CUSTOM_SMALLVECTOR OFF"
OPTIONS "FASTGLTF_ENABLE_TESTS OFF"
OPTIONS "FASTGLTF_ENABLE_EXAMPLES OFF"
OPTIONS "FASTGLTF_ENABLE_DOCS OFF"
OPTIONS "FASTGLTF_ENABLE_GLTF_RS OFF"
OPTIONS "FASTGLTF_ENABLE_ASSIMP OFF"
OPTIONS "FASTGLTF_ENABLE_DEPRECATED_EXT ON"
OPTIONS "FASTGLTF_DISABLE_CUSTOM_MEMORY_POOL OFF"
OPTIONS "FASTGLTF_USE_64BIT_FLOAT OFF"
OPTIONS "FASTGLTF_COMPILE_AS_CPP20 OFF"
OPTIONS "FASTGLTF_ENABLE_CPP_MODULES OFF"
OPTIONS "FASTGLTF_USE_STD_MODULE OFF"
)
CPM can install itself with a 2 liner
ok looking at the OptiX_Apps repo, I like how they do it, gpu pointer to vertex data and pointer to indices and material and light go on the SBT record for a shader, you get the primitive index (the index of the triangle hit), using the triangle index you look up the vertex indices for the triangle you hit via the indices buffer pointer that you get in your closest hit shader, that gives you a way to look up whatever you stored per vertex (position, normal, uv color etc) in the gpu memory, given a defined stride, and then you can also use that triangle index to look up your material
is this what you do to add a submodule?
the per instance stuff I'll stick at the beginning of the gas vertex memory
well I do a git submodule add & git checkout tag etc but yea
I mean do that very rarely
so not a big deal in my opinion
hmm I see
the only way I use submodules is by right clicking and going to tortoisegit then clicking "add submodule" 
using ui for git is a crime
tortoisegit is pretty good
I only use the cli with git
I use both but its really easy to fuck up complicated workflows like rebase interactive with cli
but then you get dumped into a text editor (probably vim)
Last time I tried rebasing was at an internship and I fucked it up lol
a git merge is also a crime
Yeah rebase is a better workflow imo
I work on a team with like 30+ engineers in a monorepo and we almost never fuck it up
or rebase then merge
everyone working in branches together and we do ok
Hey I mean whatever works for you
yes
but I like my UIs
my git history is long and linear
last time there was a branch was when saky was adding his sky stuff to it
oh, well I work in branches
I just merge them and delete them when I am done with it
I can't live without branches
also I get AI reviews that save my life
At work we all do branches, but for ASO I only do a branch if I'm gonna really make a huge breaking change, so I can easily get back to the working version
yeah I will branch off in my branch also, to try experimental stuff
like when I'm not sure something will work
I feel dirty pushing to main, I don't want to get in the habit of doing it, I'll accidentally will do it at work
I am an org admin so none of the branch protection rules apply to me
so I can accidentally do it
For solo dev I feel no need to be overly rigorous about my workflow, but on a team, yeah it makes sense to be careful
Interesting, almost all places I worked uses a "rebase before merging" workflow to preserve linear history
merge commits are horrible
you want to revert a PR, and it has a merge commit in it you are in big trouble
if this is in an incident, it's going to increase the time of impact
I'm like at the point now where I pretty much know exactly what I need to do, I just have to do the work.
I will break it up into tiny small surgical code changes, and verify it along the way. I don't like doing a big bulk of work with tons of changes in them that could be hell to debug if there's UB or something in there
so I think the first thing is to just capture the gpu memory pointer to the vertices data and stop freeing it after making the gas
then stick the pointer to it on the SBT record for the ch shader
then add the normals to it, and get the normals from there instead of from the beginning of the buffer
then add vertex colors
and then once I have this all setup, I can extract fastgltf data to GPU memory in the same way and it should just work
if I get this working today, maybe tomorrow I can start on textures, I have no idea how to do textures yet
or race track, not sure
probably race track, eager to get it back, I can't actually use it though as my controller is at home
light? sbt record? guh
I don't actually know why it's there, I just saw it was, maybe it means emissive?
I guess
believe it or not, that was 10 years ago
I have vertex colors now and an per vertex normals
ready for some gltf loading I think
I didn't pick the best colors for my cube mesh
time flies to fast, we need more gravity
i still have no shadows : >
you have mipmaps though
man RT looks great
even with such a tiny model
I actually don't have any lighting math. not even n dot l
I just did shadows
if occluded color *= 0.5f
it's pretty cool
I probably am like a month ahead of where I would have been had I done it in vulkan instead I think
who knows
yeah rt is great, you can basically get full gi, hard+soft shadows and perfect specular or rough reflections in 100 lines of code
making it fast otoh,,,
Walking around Bremerton today
Found a nice hobby store, they have Critical Role stuff 
Too big to bring back with me though
This place is busy
hrm
I think OptiX supports mipmaps through cuda texture objects already
I don't think I have to like write my own solution
well
I'm not sure, I see:
optixTexFootprint2D
optixTexFootprint2DGrad
optixTexFootprint2DLod
which take cuda texture objects and have mipmap parameters, I don't really understand how they work
the optixWhitted example in the SDK samples a texture
it's all via cuda
I wonder what optixTexFootprint2D help out with
For that you would usually implement tracking of differential rays inside the ray tracer which would allow calculating the current derivatives at a hit point which in turn lets the resp. texture grad function select the best mipmap LODs.
womp womp
yeah I guess that didn't make any sense to me anyway
yeah I'm going to figure out textures
fuck it
we work on what we want to work on at any given moment 
we'll do it live
ok so Suzanne has a texture
so my first thing I'll try is to just create a cuda texture object without it creating an error
so the first actual thing is to just load the image
I'll try and use SDL for that
since I already have it as a dependency
i spent a lot of time today reading the optix programming guide, I have a much better understanding of shader binding tables now
I think it's maybe the best explanation of an SBT I have read so far
how that I understand how the records work and how it's associated with geometry and instances it makes a lot of sense
it's just a way to index into a table to get data, and the acceleration structure on a hit returns the sbt-index which I use to get the data that was configured by me in the sbt record
in my mind an sbt record is basically working similar to a push constant, except instead of me supplying it per draw, I set up the shader program sbt record with params, and the acceleration structure gives me the sbt-index to that record, plus the index of the triangle I hit, and then I take that sbt index to get the sbt record look up all my data in a buffer just like I would use the data in a push constant to look up all my data in a buffer for my slang shader
sdl3 can load PNGs now, don't even need to add SDL_image
i think they are just wrapping stb_image though. SDL_image uses the real libpng
hrm
well suzanne has a png so I will use SDL3 since I already have it
didn't realize SDL_Image is a separate library
image decoders can get pretty big, i think they keep it separate so as to not make the main lib huge
and to keep it free of external deps
it's already huge
and that's after I went in and trimmed out stuff from the decoder libraries. it was like 400mb before that
SDL repos are absolute units
I want to load exr
so I will eventually get SDL_image
but not right now
sdl_image doesn't load exr
I use tinyexr. I tried getting the real OpenEXR SDK building but its dependency math lib requires RTTI to compile. the offline world is wild
gross
I remember reading about the math library for openexr when I was investigating how to write my own exr lib
how is that wild
it got pulled out of openexr into its own project if I recall
I actually got it to build
why does a math lib need RTTI?
well disabling language features is on you
it was a bit involved to get that thing to build if I recall, it was a while ago, and it was related to the math library
I don't remember what the specific challenge was
I didn't disable any language features
once I had the cli tool it was really cool, it has a lot of features
you can change the compression, even remove compression
oh I was responding to ravbug
hrm
I don't see SDL_LoadPNG
https://wiki.libsdl.org/SDL3/SDL_LoadPNG says it's in SDL_Surface.h
This function is available since SDL 3.4.0.
wonder what version is shipped in the vulkan sdk
GitHub Repo: sdl3, Version Tag: release-3.2.26
womp womp
hrmmm
I'll just use SDL_image, I don't want want to change my SDL
let's see how hard that is
3.4 is not even released yet
the latest pre-release is 3.3.4
you could also use stb image
C:\Users\swart\projects\pixel_storm>git submodule update --init --recursive
Submodule 'external/aom' (https://github.com/libsdl-org/aom.git) registered for path 'external/SDL_image/external/aom'
Submodule 'external/dav1d' (https://github.com/libsdl-org/dav1d.git) registered for path 'external/SDL_image/external/dav1d'
Submodule 'external/jpeg' (https://github.com/libsdl-org/jpeg.git) registered for path 'external/SDL_image/external/jpeg'
Submodule 'external/libavif' (https://github.com/libsdl-org/libavif.git) registered for path 'external/SDL_image/external/libavif'
Submodule 'external/libjxl' (https://github.com/libsdl-org/libjxl.git) registered for path 'external/SDL_image/external/libjxl'
Submodule 'external/libpng' (https://github.com/libsdl-org/libpng.git) registered for path 'external/SDL_image/external/libpng'
Submodule 'external/libtiff' (https://github.com/libsdl-org/libtiff.git) registered for path 'external/SDL_image/external/libtiff'
Submodule 'external/libwebp' (https://github.com/libsdl-org/libwebp.git) registered for path 'external/SDL_image/external/libwebp'
Submodule 'external/zlib' (https://github.com/libsdl-org/zlib.git) registered for path 'external/SDL_image/external/zlib'
lol
it's git submodules all the way down
I should have, but I would be a quitter if I gave up now
well idk what sdl's png library's trade-offs are, so I can't judge
probably not worth it whatever they are
yeah being an unfamiliar library is a disadvantage in itself
I switched from vulkan sdl to git submodule anyway
forat 😅
that dependency change dramatically increased my compile time for that file
it's like 8 seconds
brutal
I'm going to put that in its own TU tbh
excited about having textures finally again
this project is cool
tomorrow I'll try and actually make a cuda texture object from this data
I kind of like this push constant mental model with regards to understanding SBTs. I want to refine it a bit:
SBTs are arrays of data that is similar to data supplied per draw in a graphics pipeline including what shaders to use and what materials are needed.
The difference is that with ray tracing you have to supply all that data at once, so it's put into a table that is constructed in such a way that a ray trace knows how to get the index into that array via the acceleration structure upon a hit.
graphics pipeline -> supply context to render via binding/push constants/uniforms for each draw
ray trace -> supply context to render via indexes into arrays of data in a table for each ray trace
i was building openexr just last night, it was a little kick in the nuts i have to admit
SBT entry in my head is function pointer + associated data sitting next to it
Oh it is yeah
Nice, I went from being confused a couple of days ago to feeling pretty good about SBTs
Now I want to clear up cuda texture objects and ray differentials
At least just for primary rays. I am not building a path tracer yet
it looks like textures and surfaces are considered legacy in CUDA 13.1
they removed almost all mention of them from the new programming guide and are referring people using "existing code bases" to refer to the now legacy version of the programming guide
so like functions like tex2D are now only in the legacy guide
I asked on the optix forum about the status, since idk what someone using optix is supposed to use now, other than legacy functionality you have to go read about in a doc that's not being updated anymore
I’m learning how to use OptiX and I wanted to sample textures in my renderer. The CUDA 13.1 Programming Guide states: Texture and Surface memory are not discussed further in this guide as there is no advantage to using them in CUDA on any currently supported NVIDIA GPU. CUDA developers should feel free to ignore these APIs. For developers...
well they still exist so I will use them
so what do they sell now? pixelbuffers? 
ah even "lower level" so to speak
well the uvs map to some quad, and I randomly sample from that region
its a raytracing engine after all
but then I'm copying uncompressed images to the device?
perhaps optix/cuda has extensions for compressed formats?
or supports that natively somehow
or you have to go with vulkan for that part?
cuda texture objects support block compressed formats
but the documentation for working with textures is in the legacy docs
then the material has to come from ray intersections only
no textures, just material properties
I'm just going to use the stuff the way the sdk does 
Maybe i can send the block compressed image to the device and since the shader is just C++ just use a compression library in the shader
Or maybe the NVIDIA texture tools sdk has something in it for device side
When passing device pointers to NVTT, make sure the pointer refers to memory NVTT's device (and runtime API context if using the CUDA Driver API) can access. Similarly, when accessing data from device pointers returned from NVTT, make sure the current device (and context, if using the CUDA Driver API) can access allocations made by NVTT's device using the CUDA Runtime API. Device pointers are returned from nvtt::Surface::gpuData(), and used in the low-level GPU compression function API in nvtt_lowlevel.h.
seems like I can
// Shows how to use NVTT 3's low-level GPUInputBuffer API to compress a texture
// directly from a CUDA buffer. Using this API allows the input and output to
// exist on the GPU, avoiding GPU-to-CPU and CPU-to-GPU copies.
it just uses a cuda texture object lol
I suspect the response I'm going to get to my question is to continue to use tex2D and texture objects, despite it being "legacy". All of their sample code across three different NVIDIA platforms use cuda texture objects.
It is recommended programs use cudaMallocHost() to allocate buffers which
will be used to send or receive data from GPUs.
oh
Allocates page-locked memory on the host
Page-locked memory will not communicate with hard drive. Therefore, the efficiency of reading and writing in page-locked memory is more guaranteed.
Host to GPU copies are much faster when they originate from pinned (page-locked) memory.
I want mipmaps, but I don't think Suzanne's model has a good texture to see if mipmaps look good. I think the chess board ABeautifulGame gltf is better for that, but that's a whole scene. I think I will work on loading proper scene next, using ABeautifulGame and then work on mipmaps.
I think at this point though I could load my track and vehicle at any point and I could be working on my game, but I want to figure out scene material things first. Once I get tired of rendering stuff I can go back to my game.
the council will decide your fate
the council will decide your fate
Does The Combine know that headcrabs are taking their orders from Suzanne?
I'm going to pitch valve on supporting this project
Also:
(Yes, after warming up the kids with a Half-Life 2 reference I did pull out a Transformers: The Movie reference 👴 )
maybe I use deccer cubes to work on scene loading, and then abeautifulgame for mipmaps
for mipmaps you can make a tube scene
where the camera is placed inside a tube, facing along it
basically
or make a tube game
hrmm a tube game
there was a game I played a few times at a friend's very long time ago as a smol kid
that I'm reluctant to name because the company behind it is cringe now
"lagsters"
it was basically racing but within a tube track
so you didn't have to turn but you had to like
position yourself within the tube to avoid obstacles
and pick up power ups
looking down a tube is going to be very anisotropic
could also be a use case for curve primitives
I like the idea, thank you
temple run
this game was in a web browser in 2001 or something. https://www.youtube.com/watch?v=urIqLheLwtU
Here's the full verison
http://lostgames00.blogspot.com/2016/02/here-are-list-of-wildtangent-games-that.html
To play the Full version:
-
first create both a file "Shockwave.com"1st and "Betty Bad"2nd on Program File before installing,
-
then Browse the destination folder to Program Files.
For Example:
C:\Program Files\Shockwave.com\B...
feel like we are just getting scammed non-stop
actually, I should probably cancel all projects and do a remake of this 🤔
that's fun looking
hrmmm
a tube game is a series of tubes
I'd need to make some textures for the tubes
what
I thought it was a joke
big GPU is always trying to sell us more GPUs when were having fun with what we had already
but no, I want more GPUs in reality
with more stuff
real
With the largest advancement since the NVIDIA CUDA platform was invented in 2006, CUDA 13.1 is launching NVIDIA CUDA Tile. This exciting innovation introduces a new virtual instruction set for tile-based parallel programming, focusing on the ability to write algorithms at a higher level and abstract away the details of specialized hardware, such...
Pretty interesting
cutiles let’s you do array operations where tiles are sub-arrays, opaque objects, that are cheap to copy and unlock better compiler optimizations
It’s funny when NVIDIA refers to portability they mean portable across NVIDIA architectures
This isn’t available via C++ yet, just python
C++ython
wait until it pivots to Modern OpenGL tutorials (in python)
I am pretty CUDA pilled right now. It would be a bitter pill to swallow to go back to vulkan
I do miss validation
there is validation for optix, I haven't turned it on yet
I should actually
options.validationMode = OPTIX_DEVICE_CONTEXT_VALIDATION_MODE_ALL;

[OptiX][2][VALIDATION_ERROR]: [TRACE_DEPTH_EXCEEDED] Trace depth exceeded.
launch index: [891, 367, 0]
additional occurrences: 9407
The maximum optixTrace recursion depth is exceeded. See OptixPipelineLinkOptions::maxTraceDepth.
[OptiX][2][ERROR]: Error syncing stream (CUDA error string: unspecified launch failure, CUDA error code: 719)
Error recording resource event on user stream (CUDA error string: unspecified launch failure, CUDA error code: 719)
Error recording resource event on user stream (CUDA error string: unspecified launch failure, CUDA error code: 719)
Error launching work to RTX
Error recording resource event on user stream (CUDA error string: unspecified launch failure, CUDA error code: 719)
OptiX error: Error during validation mode run (code 7053) at C:\Users\swart\projects\pixel_storm\src\ps_game.cu:821
I don't recurse though 
I have misconfigured something
The link options consist of the maximum recursion depth setting for recursive ray tracing, along with pipeline level settings for debugging. However, the value for the maximum recursion depth has an upper limit that overrides an limit set by the link options.
ohh
I bet it's my shadows
OptixPipelineLinkOptions link_options = {};
link_options.maxTraceDepth = 1;
when cluster AS
yup that fixed it
soon
im looking forward to the blog post on our blog explaining all that optix shit
bjorn in for some suffering
I live to suffer
I don't think I'm a credible authority on optix, I have been using it for like 2 minutes, but I'm always happy to share how I do things, and if I do, I hope someone tears into it to tell me about all of the millions of things I'm doing incorrectly
that wont happen, nano will make sure you are on the right track 🙂
vk rt transfers pretty straightforwardly to optix, and back (vk has all the features optix has either in form of khr or nv exts)
I haven't used any nv ones (I don't have the hw) and my optix knowledge comes from zooming around optix programming guide and cycles' source but I think I'm pretty comfortable looking at optix code
yes, it seems very similar to vulkan conceptually
I saw there's now a reordering extension with the latest vk sdk
I think maybe people interested in vulkan ray tracing might benefit from reading the optix programming guide
because it's very good I think
well these days these features come to those other apis first and to optix second
optix docs are pretty bad tbh
like even look at optixAccelBuild
oh I read the guide for a high level understanding, then look at the sample code for how to use it
nothing ever tells you the instance array takes a pointer to an array of optixInstance
I am constantly switching back and forth between pdfs, and notepad++ to figure stuff out
does optix get the feature second? it has shader execution reordering and cluster AS
I thought it would get it first
what's an example of a feature that vulkan/dx got first?
the resource allocation for the vulkan/dx drivers is probably larger than for optix
so I guess that makes sense
vk got khr ser recently but it had nv ser for a few years already
nv-specific cluster AS also for like a year or more
oh
so the vulkan nv version comes out before optix
yes that only came out recently on optix
there's also khr position fetch
I think optix had pos fetch in 8 so it might w/eburger
ignore that one
I'm p sure what it's saying is that when you do copies from/to random host pointer there's an additional host copy from the pointer you passed to device-accessible mapping, or that it's host importing that region
while page-locked means it has been mapped to the device earlier
and cudaMallocHost can also be swapped to disk you know
it's just that because demand paging is generally non-existent for non-cpu-threads (current exception being linux with certain compute apis) faulting happens at submission boundaries
and the memory stays basically mlock'ed for the duration of a submission
anyway I'd rather not think about cudaMalloc this cudaMallocHost that but just use cudaMallocManaged
it's basically cudaMalloc{,Host} in one but you can cudaMemAdvise later to migrate it between host and device
also removes the need for some cudaMemcpies like I said
it's basically the nicest thing short of fully functioning unified shared memory that you get on cuda linux
where you can just pass any random pointer to the device and it just werks, no need for cudaMalloc* of any kind
also how are texture objects legacy?
the new 13.1 CUDA programming guide removed all mention of texture fetch, and says to go look in the legacy programing guide
where it is still documented
it also says texture memory offers no benefits
regarding cudaMallocManaged I don't know, it seems like I know what's happing if I explicitly cudaMemcpy. I should try it
and see in nsight what it does
to build trust
cudaMallocManaged is basically just cudaMalloc{,Host} that can be migrated between host and device
it doesn't magically do things on its own except on first use
going to try it
it says that in a subsection of "device memory spaces"
because yes there's no "texture memory" (hasn't been in a while, like 14 years)
it doesn't say texture/surface objects are deprecated though
reading the cudaMallocManaged huge doc comment
no it doesn't say deprecated
it says legacy and should no longer be used for texture memory
and if you want to know how to make sense out of texture fetching from the CUDA/Optix example code you somehow have to figure out to look at the legacy programming guide
Introduced the new CUDA Programming Guide, the official, comprehensive resource on the CUDA programming model. The guide has been restructured into five parts that cover a language agnostic overview of CUDA, introductory and advanced CUDA programming in C++ and Python, detailed descriptions of specific CUDA features, and technical appendices for reference. The legacy CUDA C++ Programming Guide remains available in this release but is deprecated and includes notices that direct readers to the new guide.
but it'd be pretty weird because there's no replacement for these apis for cases when you want to sample an image with filtering
The Release Notes for the CUDA Toolkit.
The legacy CUDA C++ Programming Guide remains available in this release but is deprecated and includes notices that direct readers to the new guide.
I'm compelled to just write it off at docs being a meme
^
in any case there are reasons to give up hw sampling sure
at which point you can also kinda give up hw doing format conversion
e.g. if you one day were to decide to implement neural textures you wouldn't be using hw sampling
though you'd still benefit from using surface apis to write out results of your renderer
how do you give up on hw sampling? can you do sampling with block compression image formats?
or just don't use compression?
you nih your own conversion as seen in neural textures
why is that better
well for a start it lets you not use hw formats but nih your own which might be better
neural textures are built on observation that you don't just compress like 3 or 4 channels at once but like 15
(the texture sets, like beside rgb base color you also have normal map, microfacet roughness, blend between metallic and dielectric, etc)
ohh
those channels have a lot of correlation
so nihing your own format lets you exploit that correlation and potentially win over hw BCn etc compression
a prerequisite to giving up on hw sampling is usually switching to stochastic filtering
(see filtering after shading)
if you're migrating away from hw sampling you usually do that step first
that entails replacing your tex2DGrads with texel fetches of randomly chosen texels
what were weights in the weighted sum of aniso box filter become PMFs of texels so it's pretty straightforward in that regard
neural textures though are kind of a meme tbh
don't bother with those
filtering after shading is useful though
so thinking through how to set up a scene, just as first attempt with very minimal materials I am thinking of this approach:
typedef struct ps_texture_t {
cudaTextureObject_t texture;
cudaArray_t pixels;
} ps_texture_t;
typedef struct ps_sampler_t {
bool linear_filtering;
} ps_sampler_t;
typedef struct ps_material_t {
ps_texture_t base_color_texture;
ps_sampler_t base_color_sampler;
} ps_material_t;
typedef struct ps_triangle_material_t {
size_t material_index;
} ps_triangle_material_t;
typedef struct ps_mesh_t {
ps_vertex_attributes_t *vertices;
uint3 *indices;
ps_triangle_material_t tri_material_map;
CUdeviceptr d_vertices;
CUdeviceptr d_indices;
CUdeviceptr d_gas_output;
OptixTraversableHandle gas_handle;
} ps_mesh_t;
typedef struct ps_node_t {
ps_m4x4 transform;
ps_node_t *child_nodes;
OptixInstance *instance;
} ps_node_t;
typedef struct ps_scene_data_t {
ps_material_t *materials;
ps_mesh_t *meshes;
ps_node_t *nodes;
ps_node_t *root;
OptixTraversableHandle ias_handle;
} ps_scene_data_t;
why?
because it gets really hard to follow after at 2 levels of indirection already
same way void* is hard to follow
anyway
oh I see
looks like roughly good general direction
I see so that I'm not passing a random pointer to the wrong function, and I get compile time safety
Your node has a pointer to children, but doesn't have a count?
that we had basically full DX9 capability on the web 25 years ago, and the situation hasn't improved much
ping @ jasperrlz about it I'm sure he'll instantly tell you where you're wrong
the video capture is creating all those artifacts when I go in
RT looks pretty good. There's no actual lighting in this, just the shadow trace
the lack of mipmaps though is brutal
nice sponzer
thanks!
nice cubes too
Is there any raster going on or is it all rt?
Yikes
Calculating the gradient for mipmaps was rough when I was doing full rt
Have fun
heh
I did a C++ metaprogramming thingy
template <typename T>
ps_internal void allocate_list(auto &list, i32 &count, T **ptr) {
count = list.size();
if (!count) {
*ptr = nullptr;
return;
}
*ptr = (T *)malloc(sizeof(T) * count);
PS_MEM_CHECK(*ptr);
}
void ps_gltf_set_counts(ps_gltf_t *gltf, ps_asset_t *asset) {
fastgltf::Asset &fgltf = gltf->asset.get();
allocate_list(fgltf.scenes, asset->num_scenes, &asset->scenes);
allocate_list(fgltf.nodes, asset->num_nodes, &asset->nodes);
allocate_list(fgltf.meshes, asset->num_meshes, &asset->meshes);
allocate_list(fgltf.textures, asset->num_textures, &asset->textures);
allocate_list(fgltf.images, asset->num_images, &asset->images);
allocate_list(fgltf.samplers, asset->num_samplers, &asset->samplers);
allocate_list(fgltf.materials, asset->num_materials, &asset->materials);
}
I don't normally do those, but here it seemed like nice to do
You forgot to add requires( is_trivially_copyable_v<T> )
I think this code is UB if the type isn't trivially copyable
also why not return the pointer instead of taking a pointer to a pointer?
to set it to nullptr if size is empty
nothing is being copied here though
it's just setting a count and allocating memory
my gltf code had so many bugs, the overall approach worked, it was just death by a thousand cuts, and I bled through so many. it's working well now though
misplaced increments, wrong pointers used, all kinds of trivial issues that just caused illegal pointer use errors or cuda checks to fail with no helpful error messages, even with optix validation on
based on https://en.cppreference.com/w/cpp/language/classes.html#Trivially_copyable_class I think all my types are trivially copyable
I just use structs, pointers and scalar values
yeah it doesn't hurt to put that in the requires clause, then you will at least get an error if you try to pass it anything else
current project scale. I quarantined fastgltf and sdl_image to cpp files, because they slow down compile time and require std and C++ semantics I don't like
fastgltf also had a bunch of problems in .cu files
I'm gonna add tinyexr, a sky and then work on ray differentials I guess
Is the EXR for the skybox?
i have that, it was simple to add, but honestly i like .hdr format better
why do you like .hdr better?
it has a cooler name
the optix sdk has a tinyexr example is why I am using it, and I don't really know what I'm doing so I need that crutch
Isn't .hdr like some borked thing with a shared exponent?
what does ASO use?
I have a tiny bit of understanding of the exr format, as in my previous project I was investigating writing my own exr library, so I read through the spec a bit and played around with the cli tool
I use OpenEXR though, not tinyexr
the problem with any of this NIH stuff and images is always having to write comression code
STB Image loads HDR, you don't need another library
I don't use stb
although for the version on steam I ship it as BC6H inside LZ4
HDR is also about 1/3rd the file size (for same resolution) as EXR
I just use EXR for dev because it's supported by GIMP & Blender
HDR is as well
my image libraries are going to be SDL_image, tinyexr (assuming it works), and NVTT and hopefully that's all I'll ever need
The reason its 1/3 the size is because it loses data lol
yes, it's compressed
Lossy compression
but 33% the size is significant
But I never ship the EXR so the size doesn't matter
exr doesn't just have a default compression
the lossy compression stage at the end is gonna be BC6H anyways
And you definetly don't wanna do two rounds of lossy compression
idk I appreciate learning about .hdr, thank you
EXR has two different option for compression
probably i guess I can leave EXR in there since it already works but I'd prefer less moving parts
maybe I was mistaken
Oh wait actually it does have more
yeah
I only ever used zip & piz
tinyexr also documents a bunch of compression options
which I think are just a subset
if you look at the open exr tool
its cli help is like the clang compiler
massive
I'm just using whatever I downloaded from polyhaven
they have both exr and hdr
ok
sun is glaring
everything looks a little nicer now
ok I guess I have to learn what ray differentials are now
at some point I should compact these acceleration structures too
and use mesh optimizer
I feel like I'm on track still to be back to working on my game before the end of the month
that paper you linked is good
the basic idea for ray differentials is to intersect two imaginary rays (one above the current pixel, and one to the right) with the plane of the surface hit by the primary ray
will this require I start resorting to a cache
then you can get the uv of the imaginary rays and then calculate the partial derivatives through shrimple subtraction
wdym
the code for what I'm describing is like 5 loc I think
you know how ray tracing renders look noisy and then incease in detail if you remain still
there's some cache yes
I think I see that in your game?
idk
instead of
auto ray = MakeRay(coord);
auto hit = TraceRay(ray);
auto color = Sample(hit.texture, hit.uv);
you do
auto ray = MakeRay(coord);
auto rayDx = MakeRay(coord + {1, 0});
auto rayDy = MakeRay(coord + {0, 1});
auto hit = TraceRay(ray);
auto hitDx = IntersectPlane(rayDx, hit.position, hit.normal);
auto hitDy = IntersectPlane(rayDy, hit.position, hit.normal);
auto uvDx = hit.uv - hitDx.uv;
auto uvDy = hit.uv - hitDy.uv;
auto color = SampleGrad(hit.texture, hit.uv, uvDx, uvDy);
you can see the extra two fake rays I added
thank you
oh that is temporal denoising which is often used with path tracing
my game uses DDGI but it still has some temporal stuff to make it work good. it doesn't have the characteristic "noisy then smooth" artifact though
ray differentials (which I don't even have btw) are just for nice texture sampling
and don't require anything but a little bit of math in the shader
I think the paper you linked has something almost exactly like this which I'm just recalling
oh ok
but writing it myself helps solidify it in my brain 😌
your paper has another technique based on ray cones, but I didn't read all of the paper to understand how the cones come into existence
I'm guessing it is simpler though
I'm concerned a bit with perf
does optix or cuda give you a function for sampling with gradients? or do you have to implement anisotropic filtering yourself?
ok nice, that's the only part that would destroy perf if you had to implement it manually
but doing it in hardware is practically free
auto ray = MakeRay(coord);
auto hit = TraceRay(ray);
auto color = Sample(hit.texture, hit.uv);
this isn't how my shaders work fwiw
the first part is in the raygen
the hit then does the sample
in your example you seem to be sampling the texture in the same place as where the trace is being called
which is the raygen shader
where do you have the texture sampling?
in the closest hit shader
you can pass it the info necessary to spawn the imaginary rays
which is just the pixel coordinate tbh
well
there are some optix functions I can call in the hit shaders
that can probably give me this yes
stuff like
const float3 ray_dir = optixGetWorldRayDirection();
const float3 P = optixGetWorldRayOrigin() + optixGetRayTmax() * ray_dir;
hmm I'm not sure how optix works exactly. how can you pass info to hit shaders?
a few diferent ways
I can over allocate the buffer for the acceleration structure and stick data there, I can vary the vertex stride and stick data there adjaced to the vertex, I can allocate device memory and stick pointers to that memory on the shader binding table records, and then there's a global param I can stick pointers and scalar values on
I don't really have any problem getting whatever data I need
in terms of application data
the data available in the context of the ray trace execution all comes via optix functions though
I don't know of a way to pass data directly from ray gen to the hit shader, I think that would be problematic
hrm
although optixTrace might let me attach something
in vulkan rt there is a small piece of data that can be passed between stages
I forgor what it's called
ah yeah that's what it's called in vk
so I think the payload is how you're "supposed" to pass, like, the pixel coord that spawned the ray
then you can use that in the hit shader to spawn the fake rays for calculating derivatives
I'm distracted trying to implement something like the api described here
https://www.sebastianaaltonen.com/blog/no-graphics-api
idk how far I'll take it. it just seems neat
ok so it seems like you can pass data in both directions with these
which is handy
that means you can do the shading in the raygen shader if you pass the hit info back from the hit shader
but having a big payload is limiting for perf so maybe you want to shade in the hit shader anyway
static __forceinline__ __device__ void setPayload(float3 p) {
optixSetPayload_0(__float_as_uint(p.x));
optixSetPayload_1(__float_as_uint(p.y));
optixSetPayload_2(__float_as_uint(p.z));
}
is how this works in my project
noice
I just write data back, but I hadn't considered reading from it in the hit shader I guess
and it's done via those functions so I didn't associate it with the param I guess
I'm not sure whether it's better to do the shading in the hit shader or the raygen shader
I think it might be cleaner to have the raygen shader do everything
my hit shader does it's own optixTraverse to see if it is occluded
also it'd map better to the code I posted 
oh I think perf favors the big raygen shader because you can do shader execution reordering at a finer granularity
I don't think SER is supported in other stages
I have only briefly read about reordering
optixReorder is available in ray gen only in optix
I like this doc you linked to
man
yes
you're building this yourself on top of vulkan?
yeah I don't think it will be hard
it's 18000 words btw
I already have?
nice
oh no
public struct MeshPart {
public void *triangles;
public uint32_t num_triangles;
public float3 *posBuffer;
public float3 *normals;
}
// TODO: kill kill kill
public struct MaterialParams {
public interpreterProgram program;
public MeshPart meshPart;
// must be aligned to 8-byte boundary
public uint8_t[256] args;
}
public struct Scene {
public int64_t maxPartsPerMesh;
public Slice<MaterialParams> materialParams;
public Slice<AccelInstance> accelInstances;
public Accel accel;
public lightAccel lightAccel;
public Sampler sampler;
// there's more fields here, consult scene.go
};
many indirections
public struct interpreterProgram {
public interpreterABI abi;
public uint32_t *code;
}
abi should not be called abi but something more specific and be behind an indirection too
host side is very simple
func NewInterpretedMaterial(blob *matc.InterpretedMaterial) *InterpretedMaterial {
device := gpu.MakeSliceUncached[uint32](len(blob.Code))
copy(device.Value(), blob.Code)
return &InterpretedMaterial{
emissive: blob.ABI.EDFCount > 0,
program: material.InterpreterProgram{
ABI: blob.ABI,
Code: gpu.SliceData(device),
},
}
}
gpu.MakeSliceUncached thingy and that's it
.Value() gets you host []T from a gpu.Slice[T]
unfortunately I have to deal with this cope until I'm able to have addresses match between host and device address spaces (or better yet have just one)
this is not something I cooked up in my go abstraction btw
I had similar thingy in my C stuff before
https://gitlab.freedesktop.org/nanokatze/gpu-hello-world/-/blob/main/main.c?ref_type=heads#L144-145 this doesn't particularly show anything because it literally only works for one (1) allocation but it's the same interface I used in my C stuff in the past
that bit of code is just something I used to poke the driver with
this is how I map device pointers to host address space in my toy program
it's horrendous yes
and it's very easy to do better
with e.g. a radix tree
I just don't care atm
is it slow?
probably
I mean my app barely spends any time on cpu atm
also I should stop calling my thing uncached
I haven't been using ~HOST_CACHED type in forever
there doesn't seem to be a penalty for it on neither deck nor on the puter with gigachungus igp
it's really not it'd be nicer if I could have things in the same address space
like cudaMallocManaged
I don't think everyone should have to write their own
yes there should be a library for it I agree
or better yet we should have demand paging so we can just use our normal allocator we hammer on host all the time
i.e. malloc
but this is me asking too much tbh
once I get to reading through the cuda programming guide I will clean up my memory use
I think I will have to rewrite some code
which parts are you trying to implement exactly
I might have had some of them implemented already
well none of it looks particularly hard to implement
it's just a reduced api surface basically
plus a hashmap for the pointer thingy
oh actually the pointer thing is gonna be harder yeah
you have to use some kind of interval mapper
yeah that's ez
I'm p happy about my hw image api
ong
idk I'm going to see what I can do with just dynamic state
uh I think shader object has renderdoc support now
and descriptor buffer
i started reading it, but no attention span left
descriptor buffer is h onestly a meme
unless you're dxvk or something then it';s very good
yeah I don't personally think it's very useful for me
but since this is an exercise in reducing api surface it might be interesting to mess around with
how so
and like you don't need descriptors for anything else so that's the only thing that needs addressing
so you create an image
but actually it creates an image + image view
so the resulting object is immediately usable for e.g. descriptors and stuff
oh yes I was actually thinking of having a hidden chungus descriptor set instead of exposing any descriptor stuff
and you just get resource indices
yeah
ok right you're in C++
you can do better than resource index cope I think
or rather hide it away
I can have decent code sharing to make it nice to work with
ok hear me out
like a function in c++ that returns Image2D and an equivalent type (with the same memory layout) in glsl
I already have that in my game
what if you had your own object type (like I have Image in my toy abstraction)
but you allocated it out of a huge std::vector<Image>
hmm like a VkJaker?
no
loike this idk this is already basically a VkImageView (well it just has two descriptors)
any3way
you could allocate it out of a chungus vector<Image>
pass the host pointer as is to the gpu
hmm why would I do that
do some math like
(pointer - where_that_vector_starts) / sizeof(Image)
????????
descriptor index.
@shut hornet cease and disperse
I see but I don't think it helps a lot tbh
it moves complexity to the shader where it now has to think about the cpu pointer offset
it helps in that you don't have several identities for your image objects, only host pointer
and in shader you hide that complexity somewhere
same as you'd on host
or
or
true but I think I can smuggle the texture identity into a few bits of the descriptor index
you could do the opposite
and have your Image be descriptor index on host
and have host methods look things up in some out of band vector<ActualImageStuff> array
so your class Image is just uint32_t index
my point is that with this you'd have the same identity for an image at all times
no Image* vs descriptor index dichotomy
and I could do that with what I'm describing
i c i c
hmm
ok I am not describing what you are
I thought you were trying to avoid having Image2D, Image3D, Image2DMS, etc.
the big image vector thing reminds me of https://floooh.github.io/2018/06/17/handles-vs-pointers.html
28-Nov-2018: I’ve added a small update at the end on how to prevent‘handle collisions’ with a per-slot generation counter
but I think avoiding those isn't necessarily a good thing
the title sounds gross already
yeah that's orthogonal to what I'm describing
group items of the same type into arrays, and treat the array base pointer as system-private
yes
but
that's generally not a great thing to do/have imo
we're only doing it here because we're coping with the fact that some hw insists on having descriptors basically live in their own address space
(and the addresses are indices)
and that's how the apis are anyway
your point is I can just pass &image to a shader and then, with just that and some secret extra info, sample the texture from the gpu
whereas with my original design I need to query an index from the image, then pass it to the shader
or, have your image itself be just an index, even on host
and store the actual data, which would be accessed by the methods of that type, out of band
but ye
how would it look to write code that uses that?
I'm just trying to see how it simplifies things
basically nothing would change except your images (image views) would become pass-by-value (because they're indices now) and you wouldn't be "getting an index", just passing the image into the device-accessed structures directly
so
hmm ok
so that means moving the zoo of different image types to the cpu if I'm striving for type safety
that's orthogonal
struct ActualFunStuff {
VkImageView vkView;
};
bit allocator;
std::vector<ActualFunStuff> actualFunStuff;
class Image {
uint32_t index;
/* e.g. */
Image SubImage(format Format, /* other things for reinterpretation */) const {
/* create new image view here */
}
/* other stuff you'd want to do with an image view on cpu */
};
well how do I do .memberWhichIsAnImage2D = someImage on the cpu then, and have the types match
ok yeah in that case you probably wanna do a bunch of typed wrappers I guess
Image2D etc
oh I guess I can have type safe image types in my project, I can typedef metallic/normal/base image shaders on the cpu when I extract them from fastgltf or wherever and then use those types in my materials that I read in the shader. right now I just have a bunch of cudaTextureObject_t
also a bit orthogonal
that feels like the wrong "plane" to encode into the types
hmm yeah there are definitely cases where you want an image object to represent different image shapes
like multisample vs not multisample
In february spent a long time being frustrated with my normal map because they ended up being srgb
but otherwise I don't think anyone needs to replace a 2D image with a 3D image or something
and had visible artifacts
that I misunderstood as being something wrong with my math but it was just the dumb image format
I mean the user (you) can just use union for when a single location might either hava a 2D or 2DMS image in it, depending on some other knob
I'm not sure encoding srgbness into the type is the solution but maybe it is
I don't know
yeah it's probably not
this might just be one of those problems where you suffer once and then the only solution is to never forget the lesson you learned
well if you all make a cool easy to use GPU API that makes life easier working with images and memory and open source it I would try it
I'm pretty happy with what I have so far in my project
your toy project uses the worldspawn library?
yes
my toy project is called worldspawn
it's just the gpu thing is sourcedropped
the rest is not open source although many of the components could be
I'll get to it some time
it's just that it needs to be sanitized
or the gazers might end up needing eyebleach
#1181368932637093888 is worldspawn tbc
I think I'll just do compute and no graphics or presentation initially
all that stuff is bloat
based
who needs mipmaps
this is totally fine
jk, jk
I think that's a good test right there
that view
also this one
so in the optix samples I see
#include <cuda/random.h>
const float2 subpixel_jitter =
subframe_index == 0 ? make_float2( 0.5f, 0.5f ) : make_float2( rnd( seed ), rnd( seed ) );
this is in a raygen shader
unsigned int seed = tea<4>( launch_idx.y * launch_dims.x + launch_idx.x, subframe_index );
// The center of each pixel is at fraction (0.5,0.5)
const float2 subpixel_jitter =
subframe_index == 0 ? make_float2( 0.5f, 0.5f ) : make_float2( rnd( seed ), rnd( seed ) );
const float2 d =
2.0f
* make_float2( ( static_cast<float>( launch_idx.x ) + subpixel_jitter.x ) / static_cast<float>( launch_dims.x ),
( static_cast<float>( launch_idx.y ) + subpixel_jitter.y ) / static_cast<float>( launch_dims.y ) )
- 1.0f;
const float3 ray_direction = normalize( d.x * U + d.y * V + W );
const float3 ray_origin = eye;
what's the question
no question. just looking through the sampler code for how to do things and commenting out loud about it I guess
I was looking at this paragraph
so I thought maybe I could just use some jitter
and sample level 0
In my renderer I always do one primary ray exactly in the center of the pixel (using rasterization) and then one additional ray that is randomly jittered (for antialiasing rather than texturing). It converges (mentally) to the correct result over time but is definitely noisy for any one single frame (as the quote you show from the article says, although it's talking about textures). I have a memory that you've maybe said you don't want the noisy look, though?
It's not directly comparable, but I found a section of a video where I zoom in close in my renderer and show an example of what I'm talking about (hopefully the link with the timestamp works): https://www.youtube.com/watch?v=oUfrtmJeyEU&t=1632s
It's hard to tell from a still shot or from video what it actually looks like in realtime (it definitely looks much better in motion than it does in a single screenshot hehe), but maybe might give at least a general idea of what texturing might look like if you went that route with only a few rays per pixel.
If you do temporal accumulation though then of course this doesn't apply as much and it actually would converge over time.
I had a goal to create graphics that emulate what my eyes see when I am in a real room in my house and I wanted the program to run interactively at high frame rates.
This video shows the results and discusses some of what makes path tracing cool. It also explores some of the other graphics techniques that contribute to making the generated imag...
thank you
I've watched that video before
it's great
you have a very clear way of explaining concepts
That's kind of you to say! I think maybe having a music background and only coming to programing later in life leads to me thinking about things differently compared to many programmers and maybe the way I talk about things also takes a more intuitive/conceptual because of that (and also because I often have trouble understanding more programmer-y explanations myself hehe). At work, at least, I sometimes am able to talk artist language hehe, so that's gratifying to hear that some of that maybe also comes through in video form.
But I didn't mean to derail your thread! I've considered maybe trying the same approach to textures when I finally add them to my current project so I'd be interested in seeing the results if you try. Intuitively, though, my suspicion is that using mipmaps is probably better both for quality and to help with texture caching, as that quote says.
you didn't derail anything, thank you. i have to learn all of these things
I think I understand what that paper is saying, I send some extra data from the ray gen corresponding a ray 1 pixel up and another one pixel right, and then I use the information I get from the triangle that I intersect with to generate additional points for the information I sent, and from that I can get a differential to calculate the lod when sampling the texture
regarding your video, I really like the HDR portion, the whole video is great
but I also like how you explain how noise helps the brain fill more information than there is
tomorrow I will generate mipmaps and try to see if I can generate a render without all the aliasing
if I can get that all done I'll get back to my track and vehicle physics, and get that working, and I'll be where I left it before my I reached that precipice
are you gonna do temporal AA
no I don't think I will. just try ray differentials
seems straight forward, it's just one ray, it's just a big pay load per ray
The thing that really made noise click for me personally was working on dithering at my previous job. Unfortunately I can't go into details too much because that industry is much more closed and secretive than games, but the effect that good dithering has on the brain feels like magic to me.
Since falling in love with dithering I've probably gone too far with embracing all kinds of noise, but it has also made me appreciate monte carlo techniques in a way that I didn't quite understand as much back when I was in school doing path tracing. It really does help the brain in a way that feels magical to me (like, the fact that we can effectively do these continuous integrals over the entire hemisphere at every single bounce is incredible to me).
@vivid barn re: mapping offset host pointers to gpu pointers
idk what the radix thing is you suggested for storing intervals, but wouldn't a simple binary tree work? the intervals can't overlap, so you just need to locate the two nearest intervals (sorted by the start of range), then one of the intervals will contain your offset pointer
I am using a shrimple linear search for now btw, but if perf sucks I will explore the option I just explained (unless you tell me it can't work for some reason)
No I use sorted list of intervals not even a tree, basically an even dumber solution
But yes an interval tree will work
Radix tree
Like how an MMU works
You have a tree that's a few levels deep, like 3 to 5
Oh I assumed: i, ok := slices.BinarySearch(deviceAddrs, uint64(p))
wikipedia calls it a radix "trie"
Oh
oh that's just an alternative name, nvm
Yeah that's searching a sorted list
Right
I guess it's a tree
implicit 🅱️ree
But it doesn't have certain properties of a tree like log n insertion
Because you have to shift things around when inserting or deleting
This basically I guess
yeah a trie (pronounced "try") is a special subset of tree for building up prefixes, it's the kind of thing I largely haven't heard outside of college classes
it's good for auto completion
haven't gotten quite to generating mipmaps yet, but I am close, I just finally got bc7 compressed texture objects to render. there's not a way to inspect the texture objects in nsight compute though, so I can't like show that, but they are compressed now
mabye it's viewbable via total memory allocation or something
I think the reason I didn't have trie come up in my mind is because the dominant term for this thing in my environment was and is "prefix tree"
"radix tree" is another common terminology
I was able to get rid of SDL_image as nvtt handles image loading. I can get rid of tinyexr too I think but I will worry about that later
I just spent most of my day figuring out how to use nvtt with block compression and cuda texture objects :\
I finally had to find an example on github that someone had working
ok I have compressed images with mip maps
now I just have to do the ray differential math in the shader
I didn't expect that to be so hard, but it's only hard because the docs were bad
well clearly it's not common enough for e.g. jaker
not sure if it was here where we derailed the thread into which thinkpads are/were good or not
but, i acquired a new work laptop lunch grill
finkpad p16 gen2, 65THOUSANDFIVEHUNDREDANDFIRDYFIVE megabytes of ram
You like it?
the keyboard is really nice
i havent tested anything on it yet : >
after christmas probably
I would like to buy new hardware but I don't want to spend the money, maybe 2027 :/
When I built my new PC, they send me 16GB of DDR5 for free with the motherboard. Back in June. How times have changed
this is very cool
still, i suppose you need to be a little smart to read the paper
I had to look a lot of it up
I'm not very strong with the math, but I mostly understand what's happening
this is just for single rays, I don't have follow up rays or anything, I'd have to expand it in that case
it did be
What if you made your skimmer game inside of Sponza? And the skimmers could be like the size of a bug, and the tracks could go around inside Sponza?
Sponza Racing (TM)
I would have to speed up loading sponza, it takes a good while right now
Sponza Loading Racing (TM)
I'm not really advocating it, and you should obviously do what you want. I just suddenly got this image in my head of going through the environment at high speeds but on tracks like yours that didn't have to follow the ground so you could zoom up and down and through windows and off the balcony and stuff and it seemed potentially cool in my head 🙃
Putting the texture loading into a thread pool should be way faster. If you'd want it to be really fast, you probably need to preprocess the scene somehow.
By the way, you've made so much progress on your project. I am really impressed. I've barely worked on mine for the last two weeks
me too, im using fastgltf for loading but i think the bottleneck is waiting on the gpu to create buffers and such
part of it too is im executing command lists for every individual mesh and waiting for them to complete before going on and processing the next one
yeah, it's all completely serial right now and it is compressing the BC5 and generating mipmaps for all sponza 50 textures before it shows anything. I can obviously speed it up, but it's not what I want to work on right now
thanks, I have been on vacation and that has helped, and also have been sort of racing to meet my goal of being back to where I was before rewriting the renderer
3 weeks now I think?
#1323084490997895198 message 11/26 - 12/19, 23 days
that's pretty good for a for a new from no code to sponza with a vulkan + optix renderer
hrm
wondering if I should bring in my previous physics code or start from new
I'm gonna work on gameplay now for a while
planning to do fully custom physics again?
I was thinking I would follow along the book I bought
then u won’t be doing gameplay for a while 
that's sort of gameplay?
I guess in my brain any work that is not rendering is gameplay
asset pipeline/rendering/animation/level editing
heh
I don't know
I think it will be fun
there's a lot of work left on the rendering but it's good enough for now I think
@vivid barn what do you think about deleting semaphores and just having submit operations return a thing that can be waited on (I know your api has something like this)
I'm mainly concerned that there might be something else I didn't think of for which I'd need a semaphore
I hope bjorn doesn't mind us hijacking his thread to talk about this 
no my api isn't like that, I have my own sync object
ah I thought your stream submissions returned an awaitable token thingy
this idea doesn't allow you to wait on a syncobj before you submit the signaller
which idk how valuable it is to you but it prevents doing a thing where you record n cmdbufs at the same time
which might be convenient in some cases like idk
say you're doing skinning so you record two cmdbufs, one does compute dispatches that do skinning, then the second one does AS rebuilds
couldn't I just add a wait parameter for the things to wait on before submitting
"things" being sync "primitives"
ok w8 to clear it up are you suggesting that you wouldn't have random VkSemaphores but rather there would be one hard-associated with each VkQueue
so if you wanna wait on something you specify (VkQueue, counter) pair?\
hmm
you can replace VkQueues with higher level thingies of your own
this would be fine I guess
I was just thinking that my Submit would return something I could wait on at a later date (whether that be another submit or just a free wait function)
instead of making the user construct semaphores themselves
yeah that sounds fine I guess
I don't like the level of abstraction that VkQueues and cmdbufs offer at all so it's kinda hard for me to judge this
yeah I've been running into that as well
what do you find you dislike?
atm I just have three hardcoded queues and the user can pick one to submit to
well it feels like a mix of low and high level concepts
but idk
gob had an idea (for vk-tier abstraction level) where instead of VkQueues, you'd be submitting to a queue family and the magic would happen underneath to find an available hw resource to run this submission as soon as its sync conditions are met
this would mean that VkQueues disappear
VK_NV_command_buffer_inheritance wouldn't work
and instead of VkQueues, you have just VkSemaphores
VkSemaphore sync becomes mandatory because you can't just rely on ordering you get by submitting to the same queue
what do you think
eliminating the weird thing where you allocate queues from a family would be nice
yeah sounds nice
if you're ok with cmdbufs I'd implement this then
I guess I kind of do have that already
I wasn't ok with cmdbufs either (I wanted sync to be "transposed" so to speak, like in cuda or opencl) so I ended up abstracting over those too
o i c
cool
submit is missing wait semaphores but yeah
typedef enum gfx_queue
{
GFX_QUEUE_GRAPHICS,
GFX_QUEUE_COMPUTE,
GFX_QUEUE_TRANSFER,
} gfx_queue;
void gfx_submit(gfx_queue queue, gfx_command_buffer command_buffer, gfx_semaphore semaphore, uint64_t signal);
right i c
note that with gob's idea you can't have a magical thing to wait on anymore I guess
btw I don't know how individual VkQueues allocated(?) from the families maps to the actual hw
unless you actually return smth like (VkSemaphore, timeline value) r rather than just timeline value
hmm that makes sense ye
it's like cpu threads
I think
oh there's some new(?) queue priority thing I saw when I was making queues yesterday
but a number of kmds/fw have problems allocating more queues at runtime basically
I think the thing windows got called "hardware assisted scheduling" means scheduling decisions are deferred to fw sitting in the gpu
scheduling in fw means that each VkQueue corresponds to an actual small ring thing of submissions that's mapped to host and to fw
there's obviously lots of VkQueues at any point in time
but there's only so many hw resources that can handle each one
so anyway same as cpu threads, when you have free hw resource, you have something somewhere choose a VkQueue for it to get busy with
I don't mind at all, I am very interested in this
this is the only important bit
it sounds like the "ideal" abstraction would make as many VkQueues as possible and treat them as a thread pool, then the user would want to submit relatively granular tasks to submit so all queues can be busy
except I'm skeptical of how that would turn out in practice given that you can saturate the gpu with just one submit
only until you run into a barrier
but yes stuff being like thread pool is my thinking too
true, but then it seems like the ideal abstraction is just a frame graph
my abstraction lets me all the queues without a frame graph.
