#voxel-engine, a creatively named Voxel engine
225 messages Β· Page 1 of 1 (latest)
Feels very strange posting about this tiny learning project in front of the massive projects here, but I decided to get back into GP by making a voxel engine using the tools I am oki-sh with, C++ and D3D12.
Some principles I want to follow during this project:
(i) Almost everyday updates (accountability for myself)
(ii) No over engineering, no unnecessary abstractions
(iii) Fast, optimized and clean code
(iv) Making something other than the generic sponza renderer
(v) Learn what 99.9% of my code actually does (i.e when I link a library what happens, what happens when I do X, and Y, etc....)
Current status as of day 1 : Going over the basics on voxels from https://sites.google.com/site/letsmakeavoxelengine/home/, in my almost single file engine (which can be found here : https://github.com/rtarun9/voxel-engine/tree/main)
[Day 2]
Today I focused mostly on a simple renderer abstraction. Nothing crazy, only abstraction it really does now is buffer creation.
Also made some small changes to the main rendering code, and now have multiple chunks being rendered.
The app load time is really slow (chunks being created I suppose), so that is something I have to focus on soon
[Day 3]
So today I tried making a simple chunk loading system, where if the camera enters a new 'chunk area / zone', a new chunk is loaded. I'm re-using the same chunk data over and over, just want to get the math done right.
Unfortunately don't have any fancy screenshots as I couldn't get it working 100% right, but pretty sure by tomorrow I will hvae this basic chunk loading system done
[Day 4]
Finally found a reason to use line list, now I have a debug visualization of the various chunks in the scene. Hopefully this will make it easier to visualize the chunks when working on the loading system
I implemented this very basic naive chunk loading algorithm, not very inefficient but gives me a baseline to implement more things
I've switched stuff around my project so that a Chunk is just a u64 chunk_index. The actual cubes (that each voxel represents) is moved over to a hashmap of {chunk_index, Cube*} so that when I move chunks between the different vectors (loaded, unloaded, and chunks_to_load), I don't have to worry much about move / copy constructors, destructors being called, and such.
Next step is going to be learning how to procedurally generate chunks (rather than just give them all a random color).
Once that is done, I will come back to the existing code and try optimizing it a bit.
That gray background is my debug grid draw, should probably disable it when posting future videos
[Day 5 and 6]
There is some small mis-calculations I am doing in figuring out which chunk the user is currently in (I was not facing this before because voxel size was fixed to 1). My end goal is to have tiny voxels but figuring out this math issue should give me a better understanding of the voxel / chunk coordinate system
[Day 7]
first, I've fixed yesterdays calculations. I was dividing the camera position with the wrong constant, so the "current chunk index" computation was wrong.
I've change the code a bit so that even if X number of chunks are to be created, each frame only Y are done so (need to figure out how chunks next to player need to be loaded first).
However, now I'm getting a lot of issues with comptr (internal release). I think this is because of some copy constructor related activities. I will refresh about rule of 5 today, and try using moves as much as possible (and try fixing this internal release issue that randomly occurs sometimes during assignment operator here :
[Day 8]
A short demo of the new and improved chunk loading system (still very naive). The demo renders only 1 voxel per chunk (render distance / how far chunks are loaded from player = 9)
The chunks near the player and loaded first, and then ones far away (until chunk_render_distance). I also decided to us a stack instead of chunk to determine which chunks to create first, because if a player goes very quickly to new region, those new chunks must be created and rendered first
A small optimization : If any of the faces of "voxel cube" is being covered by voxel cube, create the vertex buffer in such a way that those faces triangles are not rendered.
This simple optimization makes moving through the world "not be very laggy"
In this photo, only 1 in 5 voxel 'cubes' are set active, inorder to not have the image be a large wireframe mess.
A video showing current progress. Next step -> Move the chunks that are loaded but covered by all sides by other chunks into unloaded chunks array.
(Unloaded chunks are simply not rendered, memory is still allocated which sometimes causes program to crash. This will be solved when some serialization logic is implemented)
what a crisp video thanks to compression 
Starting working on procedural world gen, but discovered a few bugs along the way (looks like Z fighting, chunk loading is a bit strange at random times)
Long time since I"ve posted here, but currently working on CPU frustum culling. Have most of the code done, just have to fix some calculation issues after which I'll post a screenshot here
Followed this blog : https://fgiesen.wordpress.com/2012/08/31/frustum-planes-from-the-projection-matrix/ to implement cpu frustum culling by extracting data from the projection matrix. Not 100% working as I'm facing lots of skill issues, will have to go over it again
However, even with whatever partial culling is happening, I can finally traverse large scene without much issue. Loading chunks is causing lot of lag, have to figure that out
This project is uncovering lot of skill issues I had, will have to go through the entire code base and try finding out why its so slow
About 25% done with the project re-write. Last time I made things too unnecessarily complex, instead of trying various approaches (to say voxel storage and rendering).
Trying to fix the issues with the project this time. Currently got imgui, bindless rendering in the project, will next try to work on a greedy meshing algorithm
imo you should start with "naive" meshing
it's simple and doesn't have the annoying edge cases (T-junctions, when to split quads, UVs) like greedy meshing
if you need to reduce the number of vertices after that, then you can look at greedy meshing
I have done 'basic meshing' I think its called, basically if a voxel is covered by other voxels that voxel has no vertices to draw
ah
Its in this post here
I can't tell if it's applied per face
It is
I suppose that naive meshing was good enough, but the project was sill really laggy and very messy
As you suggested I will re-implement naive meshing and then make the code more straightforward and simple, get frustum culling, multithreaded voxel loading and see how well it performs
If it is still slow then, I will go for greedy meshing and stuff
I'd first suggest profiling to see why it was laggy
Multithreading the creation of chunk meshes means that your meshing algorithm can be pathetically slow and still not cause noticeable hitches for the user
It's kinda hard to do without shooting yourself in the feet though
Tbh if I stayed at the same place and rotated the camera, the FPS was fine (so rendering wasn't a issue)
The problem was loading / unloading chunks each frame, and creation of chunks was also slow. I was using very small chunks (16x16x16 where each voxel has a length of 1, which is maybe why it was so performance heavy?)
I think it'll stutter a lot with no multithreading, no matter the algorithm
Unless you do like 1 chunk per frame
That (1 chunk per frame to load) didn't stutter but obviously loading chunks around the player took 2-3 business days, so have to get some simple multithreading done
I've used async / future before and found it very nice to use, will probably try that out
hm
might be a nice excuse to try out c++20 coroutines
you can have a thread pool of coroutine schedulers (you'd need two libraries for this)
or you can shrimply use a thread pool and some concurrent queues
Forgot that feature existed
Will try this out as it sounds like a good learning experience
So far I've re-implemented naive meshing. Was in progress of making a multithreaded chunk loaded system (using async/future first, and then once that works move onto a co-routine based thread pool).
Also, I just came across this : D3D12_HEAP_TYPE_GPU_UPLOAD and honestly think it can make my life much easier, so going to try that out
Came across that new heap type in this article : https://gpuopen.com/learn/using-d3d12-heap-type-gpu-upload/
My GPU doesn't support it π¦ so back to using upload heaps
Async copy with a multi-threaded chunk loading system (using std::future/async) to load chunks with no lag! (using 2 different command queues, one for copy and one for direct)
Note that each cube in the video is a chunk with very small voxels. I'm using a sort of hacky mutex / condition_variable approach so have to review the code, but after I'm sure the program is correct, will work on a cpp20 - co-routine based thread pool
When I move the camera there is some slight artifacts (feels exagerated with game capture), but I will try to fix that up soon
for (size_t k = 0; k < chunk_manager.m_loaded_chunks.size(); k++)
{
const size_t i = chunk_manager.m_loaded_chunks[k].m_chunk_index;
const VoxelRenderResources render_resources = {
.position_buffer_index = static_cast<u32>(chunk_manager.m_chunk_position_buffers[i].srv_index),
.color_buffer_index = static_cast<u32>(chunk_manager.m_chunk_color_buffers[i].srv_index),
.chunk_index = static_cast<u32>(i),
.scene_constant_buffer_index = static_cast<u32>(scene_buffer.cbv_index),
};
command_list->SetGraphicsRoot32BitConstants(0u, 64u, &render_resources, 0u);
command_list->DrawInstanced((u32)chunk_manager.m_chunk_number_of_vertices[i], 1u, 0u, 0u);
}
This has to be the most hidden bug I have ever faced π« took me a while to find what it was, but once you see it you can't unsee it
In the image above, I load chunks from indices 1 -> N. The above demo hit a use case for which the above loop works fine
I'm guessing you put k where an i should've been or vice versa
those variable names are footgunny
No, loaded chunks is a hashmap and not a vector (even I forgot about this tbh)
Say the current (and only) chunk loaded has chunk index 100
First k = 0, and because of m_loaded_chunks[0], a new chunk at index 0 is created and size of loaded chunks becomes 1
Then when k = 1, size of loaded chunks array becomes 2, a new chunk is created for loaded_chunks[1] and so on
So basically chunks where being created rapidly and the chunk index was always 0, and I couldn't even see the new chunks being loaded since they were all at the SAME position
So when I wanted to render a single chunk, I would render a million chunks but they all has same position, which really makes the program lag
I switched to a range based for loop for the hash map (loaded_chunks), and now the program does not run at 15 fps 
Not entirely sure how my program fails at rendering cubes, will have to figure this out soon
I really hope its not some error in the copy queue implementation
The bottom right blue cube in renderdoc's VS output is clearly wrong, so some calculation is being messed up
I implemented reverse z, and all the above artifacts are gone π
I think what has happened is that with soo much precision given to objects very close to the near plane, when moving a bit away from camera (even a little), even on the same flat plane after projection transform the z values are varying a lot
I did a check in renderdoc and literally for the above image the depth buffer had values in the range of [0.9999 and 1]. With reverse z, I made minor changes to the code:
(i) Reversed near and far plane when providing input to projection matrix function
(ii) clear value of depth buffer to 0.0 instead of 1.0
(iii) comparison func from Less than -> Now it is greater than
In render doc, for a above like the above case, depth values range from [0.4 to ~0.1]. Now all the precision is not taken by objects very close to near plane, and visual artifacts are vastly reduced
Did a small profiling for sanity check, but the app slow down after loading a lot of chunks is indeed due to very high VPC SOL. Next step is to get indirect rendering + GPU culling
I got (CPU clip space culling) and indirect rendering setup, but a major problem still exists (14+ms of .... nothing?)

Probably. I will check in the VS profiler to see what is messing with the FPS
Loaded up nsight systems and profiled the renderer, going to learn this tool, seems useful
I saw in this talk (https://www.gdcvault.com/play/1026202/Optimizing-DX12-DXR-GPU-Workloads) that over-subscribing to GPU memory can be a issue, and by coincidence my renderer starts to lag when the number of chunks is very large (> 50K)
Very confusing, since now nsight systems reports that 6ms (out of 22) is because of CopyResource (for indirect rendering, command buffer for now is copied each frame from upload heap to default heap), and in the remaining 18ms... VPC has the highest throughput, but I already do culling?
Any ways by tomorrow / tuesday I will switch to doing culling on GPU and avoid use of the CopyResource, and then see what has to be done for the VPC throughput
Implemented GPU compute culling, now the FPS has gone from 60 to 16 π safe to say something is going wrong
Seems like I forgot to clear the counter. Will read more on UAV atomics and hopefully get this done by tomorrow
For some reason my camera is all buggy (only for rotation). Going to spend today re-learning the math and figure out whats wrong
The only issue is : I've used the same camera class for 2 other renderers and things are working fine in those projects... Not exactly sure whats causing this issue
Issue fixed, had nothing to do with the camera code but actually with precision issues π¦
If each voxel has a small length (between 1 to 16), camera rotation is fine. In the video above I had it set to 640.0f, which I suppose caused the issue
As a optimization, I decided to remove position buffers (dedicated for each chunk) and rather have a single position buffer, but each chunk now has a dedicated index buffer
So basically rather than a float3 (12 bytes), the index buffer has same number of elements but each element is a u16 (2 byte)
However, this required using a modifying index buffer WITH indirect rendering, due to which I cannot enable GPU based validation. And also, the program crashes if I use indirect rendering (with this index buffer) when compared to doing normal rendering π€
I fixed the issue. Basically on the CPU because of alignment rules indirect command was having a implicit padding which was not reflected in the SRV and UAV byte stride. By adding a uint for padding (on both the CPU and GPU indirect command struct), the program no longer crashes
This is probably going to take a while to implement, but I am going to work on a circular ring buffer for chunks (both GPU and CPU)
Right now you fly around the world for a few minutes and allocate like 8GB memory, so instead I'll pre-allocate CPU and GPU mem for 10K chunks and re-use that memory for new chunks
I'm saying 10K chunks because my chunks are basically the size of a single minecraft voxel. Due to some precision issues I'm not able to make the voxels the same size as minecraft chunks. Maybe I should work on this first
fp precision issues are now vastly reduced... all I did was few minor things:
(i) camera position in view matrix is hardcoded to zero. In shader, when computing vertex position I subtract the camera position from the vertex position before view/projection transform
(ii) N buffers for view projection matrix (never had to use this, but made significant improvements)
[67K chunks (of 8x8x8 voxels) being rendered in the above video, good to know that the renderer can handle large chunk counts]
This doesn't look too good, but not sure if this is normal
RTLUserThreadStart uses ~94% of CPU. Maybe getting a thread pool integrated where I can reuse threads without creating a new one / restarting a old one will solve this issue
ntdll (which has thread related stuff inside) takes more CPU% than my actual engine, and in the engine code also all the threading related functions take a lot of CPU
I quickly integrated https://github.com/bshoshany/thread-pool, and the unnatributed context section is massively reduced
Never realized how expensive the overhead of a thread is
Making some more changes for this thread pool logic, as rn all threads share a single command list via mutexes
The end goal is for each chunk loading thread to have its own command list, and each thread has a queue of command allocators
How has the control flow entered that if loop 
I've decided to put this project on hold, and learn systems programming & basics of gp-gpu first before continuing on this project
No particular reason, but I think I still have a lot of missing fundamental knowledge, will trying learning that before continuing on this project. Atleast I got to learn about using about multiple queues (and got a small blog post on that topic) from this project
Totally forgot I had put some stuff over here (thanks for Deccer for the reminder π
I restarted the project, and implemented a custom thread pool in C++
Was a really simple thing to do, and learnt about packaged task, async / future
Now I'm trying to rewrite the code to make sense, and start fixing memory related issues (in previous builds, when new chunks are being created, the entire app slows down because of the massive number of id3d12resources being created)
So current plan is simple : create a 500mb heap, and new chunks use the same resource as the evicted chunk, but just with the chunk related values being updated. Will try to post more updates over here as I get stuff done
New bug alert : My camera doesn't move when the game is built in release mode
and nsight and renderdoc crash (even before frame is captured)
what kind of bugs are these 
I found the keyboard issue, this is a bit embarrasing π€¦ββοΈ
I had this assert #define
and GetKeyboardState was called inside a Assert, so in release mode the function was never being called at all
I used a custom assert function mostly because of this:
The definition of the macro assert depends on another macro, NDEBUG, which is not defined by the standard library.
But fair point, I'll see what is the more modern C++ alternative for this (inline function + constexpr maybe?)
invoking ub as an assert is not a good idea as the compiler can remove it in release mode
you could invoke the equivalent of __debugbreak and then exit or abort
I think this assert function should be fine?
Would have prevent the issue where function was not being called as the function param is a bool, and the compiler won't remove UB in release mode (atleast I think so?)
Plus for I'll just use __debugbreak as its available in MSVC
this looks cursed
just include <cassert>
those snippet above look like some gpt output or some copy pasta from some cpilled kid which does c++ but doesnt want to do c++ on stack overflow
__debugbreak works but is also not portable, but i guess you still just target msvc/d3d12?
it's easy enough to port to other environments
I think this function needs to abort on fail too
well the intention was to not rely on #defs that may or maynot be present, but that shouldn't be a issue realistically
I'll switch to using normal assert itself
Yeah
yeah, the other stuff is more weird : )
i use someone's portable debug_break because im not targetting c++26 yet
Or perhaps throw an exception
normal assert ostensibly isn't viable
It prints out
Assertion failed: 0 && "This?", file C:\Dev\voxel-engine\src\main.cpp, line 20
and this message box
yeah that one
I've never done actual error handling / exception stuff, before, I will start working on this once I get that fancy single heap memory idea done, because after that I plan on restructuring the code properly
I think I found the reason execute indirect was causing nsight and renderdoc to crash. I have 2 different definitions of indirect command, one in HLSL and one in C++, and they are causing a LOT of problems for me because of implicit alignment and padding.
on C++ side:
struct indirect_command_t
{
interop::voxel_render_resources_t render_resources{};
D3D12_INDEX_BUFFER_VIEW index_buffer_view{}; # alignment of 8
D3D12_DRAW_INDEXED_ARGUMENTS draw_arguments{};
};
static_assert(sizeof(interop::gpu_indirect_command_t) == sizeof(indirect_command_t));```
On HLSL side:
```cpp
struct gpu_indirect_command_t
{
voxel_render_resources_t voxel_render_resources;
uint padding;
uint4 index_buffer_view;
uint4 draw_arguments_1;
uint draw_arguments_2;
uint padding2;
};```
I fixed it by doing a #pragma pack(push, 4) on the C++ side. Plus, I have a static assert so the size of cpu/gpu indirect command is same, hopefully mitigating such issues in future
Some metrics for reference (current application frame time when thousands of chunks are being loaded at once) :
Uhh so according to nsight systems only one CPU core is active? This doesn't look right to me at all
I can see 12 different thread id's, so now really confused as to why nsight says something else 
A screenshot from my thread pool code and console output:
sorry for the bajillion screenshots, but heres the culprit : each frame is filled with create commited resource calls, so the app is heavily CPU bottle necked
Making slow progress
Each chunk just has 'offsets' into the color / index buffer data to the actual (very large) commited resources that the chunk manager holds
This means ram / vram usage is constant. Still have a lot of issues (I'm doing memcpy / copy resource for the entire buffer each frame) + there is some index related bugs, which I am working on rihgt now
I tried having the large buffers be upload buffers (to avoid multiple copy resources per frame), but something looks fishy
I am not sure if this mean it takes 30ms between command list close and reset on the CPU side? that doesn't look right to me
Time to get tracy implemented, something is really wrong on the CPU side (and I've wanted to try out tracy for a long time, so it should be a good thing to setup)
Oh hell nah. 32ms to fill a buffer on the cpu side π
Quickest optimization fix ever 
Now when new chunks are loaded, I directly update the upload buffers with the chunk indices data
Now rendering is at a smooth 60FPS even when bajillion chunks are loading
Still some offset related issue, need to figure out why that is happening
Still something wrong with the cpu side : memory usage shoots up wayy too rapidly.
Need to have a constant allocation like what I do on the GPU side
here is a screencap with latest optimization updates :)\
ZoneScopedN("Chunk eviction");
std::vector<voxel_chunk_position_t> keys_of_chunks_to_unload{};
std::erase_if(chunk_manager.m_loaded_chunks,
[current_chunk_3d_index, &chunk_manager, &keys_of_chunks_to_unload](
const std::pair<const voxel_chunk_position_t, voxel_chunk_t> &key_value_pair) {
auto &chunk = key_value_pair.second;
if ((std::abs(chunk.m_chunk_position.x - current_chunk_3d_index.x) >=
CHUNK_RENDER_DISTANCE_PER_DIMENSION / 2) ||
(std::abs(chunk.m_chunk_position.y - current_chunk_3d_index.y) >=
CHUNK_RENDER_DISTANCE_PER_DIMENSION / 2) ||
(std::abs(chunk.m_chunk_position.z - current_chunk_3d_index.z) >=
CHUNK_RENDER_DISTANCE_PER_DIMENSION / 2))
{
keys_of_chunks_to_unload.push_back(key_value_pair.first);
return true;
}
return false;
});
if (!keys_of_chunks_to_unload.empty())
{
std::scoped_lock<std::mutex> scoped_lock(chunk_manager.m_unloaded_chunk_queue_mutex);
for (const auto &key : keys_of_chunks_to_unload)
{
auto it = chunk_manager.m_loaded_chunks.find(key);
if (it != chunk_manager.m_loaded_chunks.end()) // check added just for my sanity
{
chunk_manager.m_unloaded_voxel_chunks.emplace(std::move(it->second));
}
}
}
}```
This small piece of code is causing voxel_chunk_t constructor to be called, which internally allocates memory for a new chunk (heap allocated array of voxels).
Not sure how and where the constructor is being called, as it feels really strange
ok ik its just debug stuff but visually this looks really cool despite the compression noises. reminds me of early windows screensavers
This doesn't make sense 
Memory usage when I load chunks, don't move camera, and trivy is connected
Memory usage when trivy isn't connected for the same scenario (it never stops growing?)
what's trivy?
tracy?
tracy keeps a growing buffer of information to send to the client that it expects will be eventually connected
you can disable this via some define, but it will make any data prior to the client being connected not appear
my bad meant to say tracy, trivy is a different tool
No, its the opposite
process memory usage grows without tracy being connected
that's also what I'm describing
the client consumes the info but in this case there's no client connected so it just keeps accumulating in your app
Ahh I see what you mean now
I added the tracy_on_demand macro, things work as expected now
Thought for a second that my code was really broken. It is, but not too much
Thanks for the help π
I experienced this issue myself hehe
eventually tracked it down to having ZoneScoped; in a really hot function and then someone pointed this out to me
I see.. had a question for you (since you also work on voxel stuff)
How much Ram does your process use? in my engine, when I load around 2k chunks, VS reports 1.4-1.5 gb of memory usage. Is this too high for a voxel engine?
I don't compress my data at all (each chunk -> 8x8x8 voxels, each voxel is 1 bit, each chunk (on cpu) stores array of indices as u16 and a float3 color), but if this wayy to high might have to start changing things
what does task manager say when you run the exe outside the debugger?
vs also a button to launch without the debugger
vs says 3gb but task manager says 1.3 gb
Strange, VS says 1.4 but task manager reports 800mb
the extra stuff is probably debug structures that the debugger allocates
Perhaps, I always though for debugging had very little overhead, but I suppose its a lot more complicated than that
idk much about it either, I'm just making an inference
ahhh stuck with some office shenanigans, so will probably get back to work on this engine sometime this week
Next plan : make all cpu side memory allocations static, followed by code that is well structured (minor refactor), error handling,etc
Was able to make all CPU side allocators upfront.
However, now tis back to the strange perofrmance issues : FPS drops to 30fps when loading chunks, even though there is no memory allocations going on 
Ah I see the issue now
Basically, I have a hashmap of key value pair of chunk position & index to a vector of preallocated chunks
In the renderloop, I iterate over the hashmap and render. However, in the async chunk meshing function, I have to access this hashmap. So, I added a mutex which causes the main loop to sometimes wait on the chunk meshing function
Need to figure out a solution to this -> either have a system where main loop has a high 'priority' and can unlock mutex at any point of time, or some other way where the main loop never has to wait on anything
Before I had a function to iterate over a queue of std::async's and load said chunks into the loaded chunk hashmap. Might have to do something similar this time too (even though the std::async doesn't actually return a value)
π¦ I removed all mutexes and now use a queue of std::future, but something is still seriously wrong
maybe my idea of having 1000's of chunks loaded at the same time is wrong, or something else
When new chunks are being loaded, the app temporarily drops a couple FPS, which should not happen
Now I am at a point where the main thread is itself becoming a bottleneck : The code that checks if chunk has to be evicted or not takes 10's of MS in the worstcase scenario
Btw you can create tracked mutexes in Tracy so you can see which threads are waiting on them
Sadly it makes the client crash on my machine for some reason (possibly I have too many threads-it only supports up to 64)
Well right now I have no mutexes anywhere except in the thread pool, but I will keep this point in mind
I guess I am not handling the hashmaps correctly or something (all voxel chunk related data is static, but maybe the chunk unloading / loading setup is messed up)
Wait I have dementia you literally just said you removed them lol
Apparently I am trying to unload 5 million chunks 
Theres like 6K chunks loaded what on earth am I trying to unload
no wonder memory usage was soo high despite having static mem allocs
Curious, is this a common programming paradigm or just bad code?
I am trying to not add items to my stacks / queue if they are already present, so I use a unordered set to get O(1) look up times. Haven't seen this being used elsewhere so not sure how good this is
What's this eviction and setup stuff that's taking forever?
Thats a bug I fixed just now
Basically before I didn't have a unloaded chunk set, so the unloaded chunk queue variable grew to millions of elements (it never stopped growing as it was adding a lot of duplicate elements)
Now that I have added this queue + set logic, chunk eviction went right down to 0.5ms
no, each frame π¬
ah, it needs to be faster then
That function is called 5k times, is that even a large number of the CPU?
is this release mode
Yeah, release with debug info
Let me check if just release mode changes anything
It depends
But I wonder if you could use a different data structure that doesn't force you to iterate over every chunk
Like you could use a 2D grid with modular indexing
In release mode I observed 750us but for 24K iterations
Actually this is basically the same as a problem we had when implementing virtual shadow maps
Which is how to deallocate edge pages when the light frustum moves
Actually I didn't deallocate anything in the end so nvm
Yeah definitely, I made this cuda flocking boid sim a while back where I used a spacial grid, got a massive perf improvement using it
the "problem" is that you'd have to move every chunk around the grid whenever the player moved, which might be a little costly, though probably not too bad
I don't move anything
I suppose I should also mention that you should only do any of this when the player moves into another chunk
I mean with my idea, sry
Which is to literally have a grid representing chunks around the player and move the pointers around when the player moves
Current engine status:
Horrible compression artifacts but atleast now I can fly around rapidly (each colored cube is a entire chunk) and not run out of memory
If the pointer falls off the grid, then delete the chunk (you can define a smaller area within the grid to allocate chunks so moving back and forth a small distance can't cause a chunk to repeatedly appear and disappear)
This was what I was going for at first, but seemed fairly complex at that time
But atleast now I know how to do memory management (by not doing any allocations π )
As much as I want to improve the chunk eviction logic using a grid or something, I hate how the voxels look, so next up might be getting basic lighting, some procedural terrain gen, and then I'll come back to perf improvements
Hehe true
I'm honestly suprised how much perf wins I can get by just not being stupid
Before I use to wonder what is my bottle neck and try random optimizations, but now that I got tracy setup I know exactly whats wrong and what to fix first
So I decided to finally get good, so working on this project again
On a completely unrelated sidenote, I really don't like ImGui's new dx12 init api.
You basically have to pass a alloc function that can be used to alloc new descriptor's. However, the type definition is
void (*SrvDescriptorAllocFn)(ImGui_ImplDX12_InitInfo* info, D3D12_CPU_DESCRIPTOR_HANDLE* out_cpu_desc_handle, D3D12_GPU_DESCRIPTOR_HANDLE* out_gpu_desc_handle);
No place for a 'this' ptr, so can't really use a member function for this.
Create a lambda func and pass it in? Nope, need to use a capture list and I don't think I can use that with the above func call, because lambda cant decay to function pointer if something is captured
you should raise an issue on gh if you think a user pointer would be helpful
alternatively, copy the backend into your source tree and fix it locally
Was trying to figure out if I was maybe missing something, but I think a user pointer could be useful here
I did that with the vulkan backend and ended up changing quite a few things
I used a global variable to fix this issue (as there is a comment that indicates for now only a single descriptor would be used, but it might change in the future
the old api is still usable, but I switched to VCPKG and the package there hasn't been updated, because of which when I use the latest vcpkg package imgui fails to initialize π
Lmao ImGui_ImplDX12_InitInfo already has a UserData void pointer, only noticed this while writing the issue π
Ofcouse ocornut thought about something like this, massive skill issue for me
quickly put together some procedural generation stuff (using fast noise lite)
but right now its running at 30fps even when no chunks are loading. Without proc gen it was 144FPS π¦ will do some profiling tomorrow to see what went wrong
The issue was I was doing meshing and generation for each voxel one by one, so chunking was not really working
Also have to put this project on hold cuz I might go to jail || currently awaiting approval from employer to work on this project as its technically open source. Worst case I make the repo private and continue working on it ||
|| this got denied, so working on the project in a private repo ||
Finally some more progress, I am now using rebar, so switched to using GPU_UPLOAD heaps is many places. need to check if this has any major perf impact
This is a chunk apparently
experimenting with using 1 dispatchmesh per chunk, not really possible due to limitation that num.vertices and primitives <= 256
but still what is going on with this chunk
getting somewhere
I'm basically using a amplification shader to do the meshing work (for a 4x4x2 chunk), and each individual triangles are processed by a simple mesh shader || (with thread block size as 1x1x1 π ) ||
Will fix it once this AS + MS simple case is done
Tis done
1 amplification shader per chunk, and mesh shaders process triangles & vertices (32 'meshlets processed' in a thread group this time, where each meshlet = mesh for a singel cube face or 2 triangles)
The amplification shader does all the meshing (nothing fancy, just dont render face if 2 voxels are next to each other kind of thing), so CPU rn just needs to fill a uint32 where each bit tells if a voxel in the 4x4x2 chunk is active or not
no more R16 index buffers on CPU, so memory usage is heavily reduced
Plus, the mesh shader does backface culling so perf is a little bit better
Next step is to learn DXR as I'm sick of these ugly ahh voxels
DXR is definitely harder than I thought π || and I'm not able to get time to work on this project, not a excuse ||
Right now I'm using ray pipelines (and not ray query), and got the bare minimum down: shadows, reflections, and transperancy
I'll probably do a bit more RT in the meanwhile, and eventually (if I don't get skill issues) try custom intersection shaders that I can use for voxel rendering