voxel-engine, a creatively named Voxel engine | Graphics Programming | Page 1

fallen ocean May 15, 2024, 3:13 PM

#

A small c++/d3d12 voxel engine made for learning about GPU's, optimizations, and voxels

#

Feels very strange posting about this tiny learning project in front of the massive projects here, but I decided to get back into GP by making a voxel engine using the tools I am oki-sh with, C++ and D3D12.

Some principles I want to follow during this project:
(i) Almost everyday updates (accountability for myself)
(ii) No over engineering, no unnecessary abstractions
(iii) Fast, optimized and clean code
(iv) Making something other than the generic sponza renderer
(v) Learn what 99.9% of my code actually does (i.e when I link a library what happens, what happens when I do X, and Y, etc....)

#

Current status as of day 1 : Going over the basics on voxels from https://sites.google.com/site/letsmakeavoxelengine/home/, in my almost single file engine (which can be found here : https://github.com/rtarun9/voxel-engine/tree/main)

fallen ocean May 16, 2024, 3:17 PM

#

[Day 2]

Today I focused mostly on a simple renderer abstraction. Nothing crazy, only abstraction it really does now is buffer creation.

Also made some small changes to the main rendering code, and now have multiple chunks being rendered.
The app load time is really slow (chunks being created I suppose), so that is something I have to focus on soon

#

fallen ocean May 17, 2024, 5:55 PM

#

[Day 3]
So today I tried making a simple chunk loading system, where if the camera enters a new 'chunk area / zone', a new chunk is loaded. I'm re-using the same chunk data over and over, just want to get the math done right.

Unfortunately don't have any fancy screenshots as I couldn't get it working 100% right, but pretty sure by tomorrow I will hvae this basic chunk loading system done

fallen ocean May 18, 2024, 4:07 AM

#

[Day 4]
Finally found a reason to use line list, now I have a debug visualization of the various chunks in the scene. Hopefully this will make it easier to visualize the chunks when working on the loading system

fallen ocean May 18, 2024, 12:13 PM

#

I implemented this very basic naive chunk loading algorithm, not very inefficient but gives me a baseline to implement more things

I've switched stuff around my project so that a Chunk is just a u64 chunk_index. The actual cubes (that each voxel represents) is moved over to a hashmap of {chunk_index, Cube*} so that when I move chunks between the different vectors (loaded, unloaded, and chunks_to_load), I don't have to worry much about move / copy constructors, destructors being called, and such.

Next step is going to be learning how to procedurally generate chunks (rather than just give them all a random color).
Once that is done, I will come back to the existing code and try optimizing it a bit.

#

That gray background is my debug grid draw, should probably disable it when posting future videos

fallen ocean May 20, 2024, 3:05 PM

#

[Day 5 and 6]
There is some small mis-calculations I am doing in figuring out which chunk the user is currently in (I was not facing this before because voxel size was fixed to 1). My end goal is to have tiny voxels but figuring out this math issue should give me a better understanding of the voxel / chunk coordinate system

fallen ocean May 21, 2024, 1:16 AM

#

[Day 7]
first, I've fixed yesterdays calculations. I was dividing the camera position with the wrong constant, so the "current chunk index" computation was wrong.

I've change the code a bit so that even if X number of chunks are to be created, each frame only Y are done so (need to figure out how chunks next to player need to be loaded first).

However, now I'm getting a lot of issues with comptr (internal release). I think this is because of some copy constructor related activities. I will refresh about rule of 5 today, and try using moves as much as possible (and try fixing this internal release issue that randomly occurs sometimes during assignment operator here :

fallen ocean May 22, 2024, 4:05 PM

#

[Day 8]
A short demo of the new and improved chunk loading system (still very naive). The demo renders only 1 voxel per chunk (render distance / how far chunks are loaded from player = 9)

#

The chunks near the player and loaded first, and then ones far away (until chunk_render_distance). I also decided to us a stack instead of chunk to determine which chunks to create first, because if a player goes very quickly to new region, those new chunks must be created and rendered first

fallen ocean May 24, 2024, 5:29 PM

#

A small optimization : If any of the faces of "voxel cube" is being covered by voxel cube, create the vertex buffer in such a way that those faces triangles are not rendered.
This simple optimization makes moving through the world "not be very laggy"
In this photo, only 1 in 5 voxel 'cubes' are set active, inorder to not have the image be a large wireframe mess.

fallen ocean May 24, 2024, 5:50 PM

#

A video showing current progress. Next step -> Move the chunks that are loaded but covered by all sides by other chunks into unloaded chunks array.
(Unloaded chunks are simply not rendered, memory is still allocated which sometimes causes program to crash. This will be solved when some serialization logic is implemented)

#

what a crisp video thanks to compression froge_love

fallen ocean May 25, 2024, 4:44 AM

#

Starting working on procedural world gen, but discovered a few bugs along the way (looks like Z fighting, chunk loading is a bit strange at random times)

fallen ocean Jun 2, 2024, 1:11 PM

#

Long time since I"ve posted here, but currently working on CPU frustum culling. Have most of the code done, just have to fix some calculation issues after which I'll post a screenshot here

fallen ocean Jun 9, 2024, 11:24 AM

#

Followed this blog : https://fgiesen.wordpress.com/2012/08/31/frustum-planes-from-the-projection-matrix/ to implement cpu frustum culling by extracting data from the projection matrix. Not 100% working as I'm facing lots of skill issues, will have to go over it again

However, even with whatever partial culling is happening, I can finally traverse large scene without much issue. Loading chunks is causing lot of lag, have to figure that out

The ryg blog

fgiesen

Frustum planes from the projection matrix

Another quick one. Now this is another old trick, but it’s easy to derive and still not as well-known as it deserves to be, so here goes. All modern graphics APIs ultimately expect vertex coo…

#

This project is uncovering lot of skill issues I had, will have to go through the entire code base and try finding out why its so slow

fallen ocean Jun 18, 2024, 3:25 PM

#

About 25% done with the project re-write. Last time I made things too unnecessarily complex, instead of trying various approaches (to say voxel storage and rendering).

Trying to fix the issues with the project this time. Currently got imgui, bindless rendering in the project, will next try to work on a greedy meshing algorithm

meager marlin Jun 18, 2024, 3:38 PM

#

imo you should start with "naive" meshing

#

it's simple and doesn't have the annoying edge cases (T-junctions, when to split quads, UVs) like greedy meshing

#

if you need to reduce the number of vertices after that, then you can look at greedy meshing

fallen ocean Jun 18, 2024, 3:49 PM

#

meager marlin imo you should start with "naive" meshing

I have done 'basic meshing' I think its called, basically if a voxel is covered by other voxels that voxel has no vertices to draw

meager marlin Jun 18, 2024, 3:49 PM

#

ah

fallen ocean Jun 18, 2024, 3:49 PM

#

fallen ocean A small optimization : If any of the faces of "voxel cube" is being covered by v...

Its in this post here

meager marlin Jun 18, 2024, 3:50 PM

#

I can't tell if it's applied per face

fallen ocean Jun 18, 2024, 3:50 PM

#

It is

#

I suppose that naive meshing was good enough, but the project was sill really laggy and very messy
As you suggested I will re-implement naive meshing and then make the code more straightforward and simple, get frustum culling, multithreaded voxel loading and see how well it performs

#

If it is still slow then, I will go for greedy meshing and stuff

meager marlin Jun 18, 2024, 3:58 PM

#

I'd first suggest profiling to see why it was laggy

#

Multithreading the creation of chunk meshes means that your meshing algorithm can be pathetically slow and still not cause noticeable hitches for the user

#

It's kinda hard to do without shooting yourself in the feet though

fallen ocean Jun 18, 2024, 4:01 PM

#

meager marlin I'd first suggest profiling to see why it was laggy

Tbh if I stayed at the same place and rotated the camera, the FPS was fine (so rendering wasn't a issue)
The problem was loading / unloading chunks each frame, and creation of chunks was also slow. I was using very small chunks (16x16x16 where each voxel has a length of 1, which is maybe why it was so performance heavy?)

meager marlin Jun 18, 2024, 4:02 PM

#

I think it'll stutter a lot with no multithreading, no matter the algorithm

#

Unless you do like 1 chunk per frame

fallen ocean Jun 18, 2024, 4:02 PM

#

That (1 chunk per frame to load) didn't stutter but obviously loading chunks around the player took 2-3 business days, so have to get some simple multithreading done

#

I've used async / future before and found it very nice to use, will probably try that out

meager marlin Jun 18, 2024, 4:05 PM

#

hm

#

might be a nice excuse to try out c++20 coroutines

#

you can have a thread pool of coroutine schedulers (you'd need two libraries for this)

#

or you can shrimply use a thread pool and some concurrent queues

fallen ocean Jun 18, 2024, 4:22 PM

#

meager marlin might be a nice excuse to try out c++20 coroutines

Forgot that feature existed
Will try this out as it sounds like a good learning experience

fallen ocean Jun 20, 2024, 8:39 AM

#

So far I've re-implemented naive meshing. Was in progress of making a multithreaded chunk loaded system (using async/future first, and then once that works move onto a co-routine based thread pool).
Also, I just came across this : D3D12_HEAP_TYPE_GPU_UPLOAD and honestly think it can make my life much easier, so going to try that out

#

Came across that new heap type in this article : https://gpuopen.com/learn/using-d3d12-heap-type-gpu-upload/

AMD GPUOpen

Effective Use of the New D3D12_HEAP_TYPE_GPU_UPLOAD

The D3D12_HEAP_TYPE_GPU_UPLOAD flag in Direct3D 12 provides a good alternative to other ways of uploading data from the CPU to the GPU. Check out our quick guide to effective use of this flag.

fallen ocean Jun 20, 2024, 10:48 AM

#

My GPU doesn't support it 😦 so back to using upload heaps

fallen ocean Jun 21, 2024, 5:56 AM

#

Async copy with a multi-threaded chunk loading system (using std::future/async) to load chunks with no lag! (using 2 different command queues, one for copy and one for direct)
Note that each cube in the video is a chunk with very small voxels. I'm using a sort of hacky mutex / condition_variable approach so have to review the code, but after I'm sure the program is correct, will work on a cpp20 - co-routine based thread pool

#

When I move the camera there is some slight artifacts (feels exagerated with game capture), but I will try to fix that up soon

fallen ocean Jun 22, 2024, 10:08 AM

#

     for (size_t k = 0; k < chunk_manager.m_loaded_chunks.size(); k++)
        {
            const size_t i = chunk_manager.m_loaded_chunks[k].m_chunk_index;

            const VoxelRenderResources render_resources = {
                .position_buffer_index = static_cast<u32>(chunk_manager.m_chunk_position_buffers[i].srv_index),
                .color_buffer_index = static_cast<u32>(chunk_manager.m_chunk_color_buffers[i].srv_index),
                .chunk_index = static_cast<u32>(i),
                .scene_constant_buffer_index = static_cast<u32>(scene_buffer.cbv_index),
            };

            command_list->SetGraphicsRoot32BitConstants(0u, 64u, &render_resources, 0u);
            command_list->DrawInstanced((u32)chunk_manager.m_chunk_number_of_vertices[i], 1u, 0u, 0u);
        }

#

This has to be the most hidden bug I have ever faced 🫠 took me a while to find what it was, but once you see it you can't unsee it
In the image above, I load chunks from indices 1 -> N. The above demo hit a use case for which the above loop works fine

meager marlin Jun 22, 2024, 10:10 AM

#

I'm guessing you put k where an i should've been or vice versa

#

those variable names are footgunny

fallen ocean Jun 22, 2024, 10:11 AM

#

meager marlin I'm guessing you put `k` where an `i` should've been or vice versa

No, loaded chunks is a hashmap and not a vector (even I forgot about this tbh)
Say the current (and only) chunk loaded has chunk index 100
First k = 0, and because of m_loaded_chunks[0], a new chunk at index 0 is created and size of loaded chunks becomes 1
Then when k = 1, size of loaded chunks array becomes 2, a new chunk is created for loaded_chunks[1] and so on

So basically chunks where being created rapidly and the chunk index was always 0, and I couldn't even see the new chunks being loaded since they were all at the SAME position

#

So when I wanted to render a single chunk, I would render a million chunks but they all has same position, which really makes the program lag

#

I switched to a range based for loop for the hash map (loaded_chunks), and now the program does not run at 15 fps frog_dum

fallen ocean Jun 24, 2024, 3:27 PM

#

Not entirely sure how my program fails at rendering cubes, will have to figure this out soon
I really hope its not some error in the copy queue implementation

#

The bottom right blue cube in renderdoc's VS output is clearly wrong, so some calculation is being messed up

fallen ocean Jun 27, 2024, 12:40 AM

#

I implemented reverse z, and all the above artifacts are gone 🙂

#

I think what has happened is that with soo much precision given to objects very close to the near plane, when moving a bit away from camera (even a little), even on the same flat plane after projection transform the z values are varying a lot

#

I did a check in renderdoc and literally for the above image the depth buffer had values in the range of [0.9999 and 1]. With reverse z, I made minor changes to the code:
(i) Reversed near and far plane when providing input to projection matrix function
(ii) clear value of depth buffer to 0.0 instead of 1.0
(iii) comparison func from Less than -> Now it is greater than

In render doc, for a above like the above case, depth values range from [0.4 to ~0.1]. Now all the precision is not taken by objects very close to near plane, and visual artifacts are vastly reduced

fallen ocean Jun 29, 2024, 5:40 AM

#

Did a small profiling for sanity check, but the app slow down after loading a lot of chunks is indeed due to very high VPC SOL. Next step is to get indirect rendering + GPU culling

fallen ocean Jun 29, 2024, 3:19 PM

#

I got (CPU clip space culling) and indirect rendering setup, but a major problem still exists (14+ms of .... nothing?)

meager marlin Jun 29, 2024, 7:11 PM

#

cpubound

fallen ocean Jun 30, 2024, 1:35 AM

#

Probably. I will check in the VS profiler to see what is messing with the FPS

fallen ocean Jun 30, 2024, 4:40 AM

#

Loaded up nsight systems and profiled the renderer, going to learn this tool, seems useful
I saw in this talk (https://www.gdcvault.com/play/1026202/Optimizing-DX12-DXR-GPU-Workloads) that over-subscribing to GPU memory can be a issue, and by coincidence my renderer starts to lag when the number of chunks is very large (> 50K)

Optimizing DX12/DXR GPU Workloads using Nsight Graphics: GPU Trace ...

This talk shows how Nsight GPU Trace can be used to determine the performance limiters of any DX12 workload on NVIDIA Turing GPUs, and improve performance by applying architecture-aware optimizations. Because the tool captures all of its metrics...

fallen ocean Jun 30, 2024, 3:27 PM

#

Very confusing, since now nsight systems reports that 6ms (out of 22) is because of CopyResource (for indirect rendering, command buffer for now is copied each frame from upload heap to default heap), and in the remaining 18ms... VPC has the highest throughput, but I already do culling?

#

Any ways by tomorrow / tuesday I will switch to doing culling on GPU and avoid use of the CopyResource, and then see what has to be done for the VPC throughput

fallen ocean Jul 1, 2024, 3:26 PM

#

Implemented GPU compute culling, now the FPS has gone from 60 to 16 💀 safe to say something is going wrong

fallen ocean Jul 1, 2024, 4:06 PM

#

Seems like I forgot to clear the counter. Will read more on UAV atomics and hopefully get this done by tomorrow

fallen ocean Jul 4, 2024, 2:36 AM

#

For some reason my camera is all buggy (only for rotation). Going to spend today re-learning the math and figure out whats wrong
The only issue is : I've used the same camera class for 2 other renderers and things are working fine in those projects... Not exactly sure whats causing this issue

fallen ocean Jul 5, 2024, 2:46 AM

#

Issue fixed, had nothing to do with the camera code but actually with precision issues 😦
If each voxel has a small length (between 1 to 16), camera rotation is fine. In the video above I had it set to 640.0f, which I suppose caused the issue

fallen ocean Jul 6, 2024, 6:29 AM

#

As a optimization, I decided to remove position buffers (dedicated for each chunk) and rather have a single position buffer, but each chunk now has a dedicated index buffer

#

So basically rather than a float3 (12 bytes), the index buffer has same number of elements but each element is a u16 (2 byte)

However, this required using a modifying index buffer WITH indirect rendering, due to which I cannot enable GPU based validation. And also, the program crashes if I use indirect rendering (with this index buffer) when compared to doing normal rendering 🤔

fallen ocean Jul 6, 2024, 7:01 AM

#

I fixed the issue. Basically on the CPU because of alignment rules indirect command was having a implicit padding which was not reflected in the SRV and UAV byte stride. By adding a uint for padding (on both the CPU and GPU indirect command struct), the program no longer crashes

fallen ocean Jul 6, 2024, 8:27 AM

#

This is probably going to take a while to implement, but I am going to work on a circular ring buffer for chunks (both GPU and CPU)
Right now you fly around the world for a few minutes and allocate like 8GB memory, so instead I'll pre-allocate CPU and GPU mem for 10K chunks and re-use that memory for new chunks

#

I'm saying 10K chunks because my chunks are basically the size of a single minecraft voxel. Due to some precision issues I'm not able to make the voxels the same size as minecraft chunks. Maybe I should work on this first

fallen ocean Jul 6, 2024, 11:59 PM

#

fp precision issues are now vastly reduced... all I did was few minor things:
(i) camera position in view matrix is hardcoded to zero. In shader, when computing vertex position I subtract the camera position from the vertex position before view/projection transform
(ii) N buffers for view projection matrix (never had to use this, but made significant improvements)

#

[67K chunks (of 8x8x8 voxels) being rendered in the above video, good to know that the renderer can handle large chunk counts]

fallen ocean Jul 11, 2024, 2:58 AM

#

This doesn't look too good, but not sure if this is normal

#

RTLUserThreadStart uses ~94% of CPU. Maybe getting a thread pool integrated where I can reuse threads without creating a new one / restarting a old one will solve this issue

#

ntdll (which has thread related stuff inside) takes more CPU% than my actual engine, and in the engine code also all the threading related functions take a lot of CPU

fallen ocean Jul 12, 2024, 7:52 AM

#

fallen ocean I got (CPU clip space culling) and indirect rendering setup, but a major problem...

I quickly integrated https://github.com/bshoshany/thread-pool, and the unnatributed context section is massively reduced

#

Never realized how expensive the overhead of a thread is

fallen ocean Jul 13, 2024, 3:49 AM

#

Making some more changes for this thread pool logic, as rn all threads share a single command list via mutexes
The end goal is for each chunk loading thread to have its own command list, and each thread has a queue of command allocators

fallen ocean Jul 13, 2024, 10:08 AM

#

How has the control flow entered that if loop what

fallen ocean Jul 21, 2024, 11:52 AM

#

I've decided to put this project on hold, and learn systems programming & basics of gp-gpu first before continuing on this project

No particular reason, but I think I still have a lot of missing fundamental knowledge, will trying learning that before continuing on this project. Atleast I got to learn about using about multiple queues (and got a small blog post on that topic) from this project

fallen ocean Jan 18, 2025, 11:01 AM

#

Totally forgot I had put some stuff over here (thanks for Deccer for the reminder 🙂

I restarted the project, and implemented a custom thread pool in C++
Was a really simple thing to do, and learnt about packaged task, async / future

#

Now I'm trying to rewrite the code to make sense, and start fixing memory related issues (in previous builds, when new chunks are being created, the entire app slows down because of the massive number of id3d12resources being created)

#

So current plan is simple : create a 500mb heap, and new chunks use the same resource as the evicted chunk, but just with the chunk related values being updated. Will try to post more updates over here as I get stuff done

fallen ocean Jan 18, 2025, 1:22 PM

#

New bug alert : My camera doesn't move when the game is built in release mode
and nsight and renderdoc crash (even before frame is captured)

what kind of bugs are these froge_bleak

fallen ocean Jan 18, 2025, 3:22 PM

#

I found the keyboard issue, this is a bit embarrasing 🤦‍♂️

I had this assert #define

#

and GetKeyboardState was called inside a Assert, so in release mode the function was never being called at all

limber smelt Jan 18, 2025, 6:02 PM

#

are you still writing C like a cave man?

#

or have you switched to c++ yet?

fallen ocean Jan 18, 2025, 11:11 PM

#

I used a custom assert function mostly because of this:

The definition of the macro assert depends on another macro, NDEBUG, which is not defined by the standard library.
But fair point, I'll see what is the more modern C++ alternative for this (inline function + constexpr maybe?)

meager marlin Jan 18, 2025, 11:12 PM

#

invoking ub as an assert is not a good idea as the compiler can remove it in release mode

#

you could invoke the equivalent of __debugbreak and then exit or abort

fallen ocean Jan 18, 2025, 11:18 PM

#

meager marlin you could invoke the equivalent of __debugbreak and then exit or abort

I think this assert function should be fine?

#

Would have prevent the issue where function was not being called as the function param is a bool, and the compiler won't remove UB in release mode (atleast I think so?)
Plus for I'll just use __debugbreak as its available in MSVC

limber smelt Jan 18, 2025, 11:20 PM

#

this looks cursed

#

just include <cassert>

#

those snippet above look like some gpt output or some copy pasta from some cpilled kid which does c++ but doesnt want to do c++ on stack overflow

#

__debugbreak works but is also not portable, but i guess you still just target msvc/d3d12?

meager marlin Jan 18, 2025, 11:23 PM

#

it's easy enough to port to other environments

#

I think this function needs to abort on fail too

fallen ocean Jan 18, 2025, 11:23 PM

#

well the intention was to not rely on #defs that may or maynot be present, but that shouldn't be a issue realistically

I'll switch to using normal assert itself

fallen ocean Jan 18, 2025, 11:23 PM

#

limber smelt __debugbreak works but is also not portable, but i guess you still just target m...

Yeah

limber smelt Jan 18, 2025, 11:23 PM

#

yeah, the other stuff is more weird : )

#

i use someone's portable debug_break because im not targetting c++26 yet

meager marlin Jan 18, 2025, 11:24 PM

#

meager marlin I think this function needs to abort on fail too

Or perhaps throw an exception

limber smelt Jan 18, 2025, 11:25 PM

#

assert will throw an exception as the default no?

#

that ugly message box

meager marlin Jan 18, 2025, 11:27 PM

#

normal assert ostensibly isn't viable

fallen ocean Jan 18, 2025, 11:29 PM

#

limber smelt that ugly message box

It prints out

Assertion failed: 0 && "This?", file C:\Dev\voxel-engine\src\main.cpp, line 20
and this message box

limber smelt Jan 18, 2025, 11:38 PM

#

yeah that one

fallen ocean Jan 19, 2025, 5:25 AM

#

I've never done actual error handling / exception stuff, before, I will start working on this once I get that fancy single heap memory idea done, because after that I plan on restructuring the code properly

#

I think I found the reason execute indirect was causing nsight and renderdoc to crash. I have 2 different definitions of indirect command, one in HLSL and one in C++, and they are causing a LOT of problems for me because of implicit alignment and padding.

on C++ side:


    struct indirect_command_t
    {
        interop::voxel_render_resources_t render_resources{};
        D3D12_INDEX_BUFFER_VIEW index_buffer_view{}; # alignment of 8
        D3D12_DRAW_INDEXED_ARGUMENTS draw_arguments{}; 
    };
    static_assert(sizeof(interop::gpu_indirect_command_t) == sizeof(indirect_command_t));```

On HLSL side:
```cpp
    struct gpu_indirect_command_t
    {
        voxel_render_resources_t voxel_render_resources;
        uint padding;
        uint4 index_buffer_view;
        uint4 draw_arguments_1;
        uint draw_arguments_2;
        uint padding2;
    };```
I fixed it by doing a   #pragma pack(push, 4) on the C++ side. Plus, I have a static assert so the size of cpu/gpu indirect command is same, hopefully mitigating such issues in future

fallen ocean Jan 19, 2025, 6:38 AM

#

Some metrics for reference (current application frame time when thousands of chunks are being loaded at once) :

fallen ocean Jan 19, 2025, 9:09 AM

#

Uhh so according to nsight systems only one CPU core is active? This doesn't look right to me at all

#

I can see 12 different thread id's, so now really confused as to why nsight says something else frog_thinkk

A screenshot from my thread pool code and console output:

fallen ocean Jan 19, 2025, 9:33 AM

#

sorry for the bajillion screenshots, but heres the culprit : each frame is filled with create commited resource calls, so the app is heavily CPU bottle necked

#

fallen ocean Jan 20, 2025, 3:07 AM

#

Making slow progress
Each chunk just has 'offsets' into the color / index buffer data to the actual (very large) commited resources that the chunk manager holds

This means ram / vram usage is constant. Still have a lot of issues (I'm doing memcpy / copy resource for the entire buffer each frame) + there is some index related bugs, which I am working on rihgt now

fallen ocean Jan 21, 2025, 2:45 PM

#

I tried having the large buffers be upload buffers (to avoid multiple copy resources per frame), but something looks fishy

#

I am not sure if this mean it takes 30ms between command list close and reset on the CPU side? that doesn't look right to me

fallen ocean Jan 21, 2025, 3:16 PM

#

Time to get tracy implemented, something is really wrong on the CPU side (and I've wanted to try out tracy for a long time, so it should be a good thing to setup)

fallen ocean Jan 21, 2025, 4:06 PM

#

Oh hell nah. 32ms to fill a buffer on the cpu side 💀

#

Quickest optimization fix ever KEKW
Now when new chunks are loaded, I directly update the upload buffers with the chunk indices data
Now rendering is at a smooth 60FPS even when bajillion chunks are loading

Still some offset related issue, need to figure out why that is happening

#

Still something wrong with the cpu side : memory usage shoots up wayy too rapidly.
Need to have a constant allocation like what I do on the GPU side

here is a screencap with latest optimization updates :)\

fallen ocean Jan 22, 2025, 3:04 PM

#


                ZoneScopedN("Chunk eviction");
                std::vector<voxel_chunk_position_t> keys_of_chunks_to_unload{};
                std::erase_if(chunk_manager.m_loaded_chunks,
                              [current_chunk_3d_index, &chunk_manager, &keys_of_chunks_to_unload](
                                  const std::pair<const voxel_chunk_position_t, voxel_chunk_t> &key_value_pair) {
                                  auto &chunk = key_value_pair.second;

                                  if ((std::abs(chunk.m_chunk_position.x - current_chunk_3d_index.x) >=
                                       CHUNK_RENDER_DISTANCE_PER_DIMENSION / 2) ||
                                      (std::abs(chunk.m_chunk_position.y - current_chunk_3d_index.y) >=
                                       CHUNK_RENDER_DISTANCE_PER_DIMENSION / 2) ||
                                      (std::abs(chunk.m_chunk_position.z - current_chunk_3d_index.z) >=
                                       CHUNK_RENDER_DISTANCE_PER_DIMENSION / 2))
                                  {
                                      keys_of_chunks_to_unload.push_back(key_value_pair.first);
                                      return true;
                                  }

                                  return false;
                              });
                if (!keys_of_chunks_to_unload.empty())
                {
                    std::scoped_lock<std::mutex> scoped_lock(chunk_manager.m_unloaded_chunk_queue_mutex);
                    for (const auto &key : keys_of_chunks_to_unload)
                    {
                        auto it = chunk_manager.m_loaded_chunks.find(key);
                        if (it != chunk_manager.m_loaded_chunks.end()) // check added just for my sanity
                        {
                            chunk_manager.m_unloaded_voxel_chunks.emplace(std::move(it->second));
                        }
                    }
                }
            }```

#

This small piece of code is causing voxel_chunk_t constructor to be called, which internally allocates memory for a new chunk (heap allocated array of voxels).
Not sure how and where the constructor is being called, as it feels really strange

tough pier Jan 22, 2025, 3:16 PM

#

fallen ocean Still something wrong with the cpu side : memory usage shoots up wayy too rapidl...

ok ik its just debug stuff but visually this looks really cool despite the compression noises. reminds me of early windows screensavers

fallen ocean Jan 23, 2025, 12:07 AM

#

This doesn't make sense what
Memory usage when I load chunks, don't move camera, and trivy is connected

#

Memory usage when trivy isn't connected for the same scenario (it never stops growing?)

meager marlin Jan 23, 2025, 12:09 AM

#

what's trivy?

#

tracy?

#

tracy keeps a growing buffer of information to send to the client that it expects will be eventually connected

#

you can disable this via some define, but it will make any data prior to the client being connected not appear

fallen ocean Jan 23, 2025, 12:10 AM

#

my bad meant to say tracy, trivy is a different tool

fallen ocean Jan 23, 2025, 12:11 AM

#

meager marlin tracy keeps a growing buffer of information to send to the client that it expect...

No, its the opposite
process memory usage grows without tracy being connected

meager marlin Jan 23, 2025, 12:11 AM

#

that's also what I'm describing

#

the client consumes the info but in this case there's no client connected so it just keeps accumulating in your app

fallen ocean Jan 23, 2025, 12:12 AM

#

Ahh I see what you mean now

meager marlin Jan 23, 2025, 12:12 AM

#

fallen ocean Jan 23, 2025, 12:14 AM

#

I added the tracy_on_demand macro, things work as expected now
Thought for a second that my code was really broken. It is, but not too much
Thanks for the help 🙂

meager marlin Jan 23, 2025, 12:15 AM

#

I experienced this issue myself hehe

#

eventually tracked it down to having ZoneScoped; in a really hot function and then someone pointed this out to me

fallen ocean Jan 23, 2025, 12:16 AM

#

I see.. had a question for you (since you also work on voxel stuff)
How much Ram does your process use? in my engine, when I load around 2k chunks, VS reports 1.4-1.5 gb of memory usage. Is this too high for a voxel engine?

#

I don't compress my data at all (each chunk -> 8x8x8 voxels, each voxel is 1 bit, each chunk (on cpu) stores array of indices as u16 and a float3 color), but if this wayy to high might have to start changing things

meager marlin Jan 23, 2025, 12:21 AM

#

what does task manager say when you run the exe outside the debugger?

#

vs also a button to launch without the debugger

#

vs says 3gb but task manager says 1.3 gb

fallen ocean Jan 23, 2025, 12:25 AM

#

Strange, VS says 1.4 but task manager reports 800mb

meager marlin Jan 23, 2025, 12:25 AM

#

the extra stuff is probably debug structures that the debugger allocates

fallen ocean Jan 23, 2025, 12:26 AM

#

Perhaps, I always though for debugging had very little overhead, but I suppose its a lot more complicated than that

meager marlin Jan 23, 2025, 12:29 AM

#

idk much about it either, I'm just making an inference

fallen ocean Jan 27, 2025, 2:52 AM

#

ahhh stuck with some office shenanigans, so will probably get back to work on this engine sometime this week
Next plan : make all cpu side memory allocations static, followed by code that is well structured (minor refactor), error handling,etc

fallen ocean Feb 3, 2025, 3:41 PM

#

Was able to make all CPU side allocators upfront.
However, now tis back to the strange perofrmance issues : FPS drops to 30fps when loading chunks, even though there is no memory allocations going on frog_dum

#

fallen ocean Feb 3, 2025, 4:03 PM

#

Ah I see the issue now
Basically, I have a hashmap of key value pair of chunk position & index to a vector of preallocated chunks

In the renderloop, I iterate over the hashmap and render. However, in the async chunk meshing function, I have to access this hashmap. So, I added a mutex which causes the main loop to sometimes wait on the chunk meshing function

#

Need to figure out a solution to this -> either have a system where main loop has a high 'priority' and can unlock mutex at any point of time, or some other way where the main loop never has to wait on anything

#

Before I had a function to iterate over a queue of std::async's and load said chunks into the loaded chunk hashmap. Might have to do something similar this time too (even though the std::async doesn't actually return a value)

fallen ocean Feb 8, 2025, 12:11 PM

#

😦 I removed all mutexes and now use a queue of std::future, but something is still seriously wrong
maybe my idea of having 1000's of chunks loaded at the same time is wrong, or something else

When new chunks are being loaded, the app temporarily drops a couple FPS, which should not happen

Now I am at a point where the main thread is itself becoming a bottleneck : The code that checks if chunk has to be evicted or not takes 10's of MS in the worstcase scenario

meager marlin Feb 8, 2025, 12:12 PM

#

Btw you can create tracked mutexes in Tracy so you can see which threads are waiting on them

#

Sadly it makes the client crash on my machine for some reason (possibly I have too many threads-it only supports up to 64)

fallen ocean Feb 8, 2025, 12:14 PM

#

meager marlin Btw you can create tracked mutexes in Tracy so you can see which threads are wai...

Well right now I have no mutexes anywhere except in the thread pool, but I will keep this point in mind
I guess I am not handling the hashmaps correctly or something (all voxel chunk related data is static, but maybe the chunk unloading / loading setup is messed up)

meager marlin Feb 8, 2025, 12:15 PM

#

Wait I have dementia you literally just said you removed them lol

fallen ocean Feb 8, 2025, 12:22 PM

#

Apparently I am trying to unload 5 million chunks froghorror
Theres like 6K chunks loaded what on earth am I trying to unload

#

no wonder memory usage was soo high despite having static mem allocs

#

Curious, is this a common programming paradigm or just bad code?
I am trying to not add items to my stacks / queue if they are already present, so I use a unordered set to get O(1) look up times. Haven't seen this being used elsewhere so not sure how good this is

meager marlin Feb 8, 2025, 12:31 PM

#

hmm

#

I can't think of a better way off the top of my head, so it's probably fine

meager marlin Feb 8, 2025, 12:34 PM

#

fallen ocean 😦 I removed all mutexes and now use a queue of std::future, but something is st...

What's this eviction and setup stuff that's taking forever?

fallen ocean Feb 8, 2025, 12:42 PM

#

meager marlin What's this eviction and setup stuff that's taking forever?

Thats a bug I fixed just now
Basically before I didn't have a unloaded chunk set, so the unloaded chunk queue variable grew to millions of elements (it never stopped growing as it was adding a lot of duplicate elements)

Now that I have added this queue + set logic, chunk eviction went right down to 0.5ms

#

meager marlin Feb 8, 2025, 12:43 PM

#

500us is still a long time

#

Is that just for the first frame

fallen ocean Feb 8, 2025, 12:45 PM

#

meager marlin Is that just for the first frame

no, each frame 😬

meager marlin Feb 8, 2025, 12:46 PM

#

ah, it needs to be faster then

fallen ocean Feb 8, 2025, 12:47 PM

#

That function is called 5k times, is that even a large number of the CPU?

meager marlin Feb 8, 2025, 12:47 PM

#

is this release mode

fallen ocean Feb 8, 2025, 12:48 PM

#

Yeah, release with debug info
Let me check if just release mode changes anything

meager marlin Feb 8, 2025, 12:48 PM

#

fallen ocean That function is called 5k times, is that even a large number of the CPU?

It depends

#

But I wonder if you could use a different data structure that doesn't force you to iterate over every chunk

#

Like you could use a 2D grid with modular indexing

fallen ocean Feb 8, 2025, 12:51 PM

#

fallen ocean Yeah, release with debug info Let me check if just release mode changes anything

In release mode I observed 750us but for 24K iterations

meager marlin Feb 8, 2025, 12:51 PM

#

Actually this is basically the same as a problem we had when implementing virtual shadow maps

#

Which is how to deallocate edge pages when the light frustum moves

#

Actually I didn't deallocate anything in the end so nvm

fallen ocean Feb 8, 2025, 12:54 PM

#

meager marlin Like you could use a 2D grid with modular indexing

Yeah definitely, I made this cuda flocking boid sim a while back where I used a spacial grid, got a massive perf improvement using it

meager marlin Feb 8, 2025, 12:55 PM

#

the "problem" is that you'd have to move every chunk around the grid whenever the player moved, which might be a little costly, though probably not too bad

fallen ocean Feb 8, 2025, 12:56 PM

#

meager marlin the "problem" is that you'd have to move every chunk around the grid whenever th...

I don't move anything

meager marlin Feb 8, 2025, 12:56 PM

#

I suppose I should also mention that you should only do any of this when the player moves into another chunk

meager marlin Feb 8, 2025, 12:56 PM

#

fallen ocean I don't move anything

I mean with my idea, sry

#

Which is to literally have a grid representing chunks around the player and move the pointers around when the player moves

fallen ocean Feb 8, 2025, 12:58 PM

#

Current engine status:

#

Horrible compression artifacts but atleast now I can fly around rapidly (each colored cube is a entire chunk) and not run out of memory

meager marlin Feb 8, 2025, 12:59 PM

#

If the pointer falls off the grid, then delete the chunk (you can define a smaller area within the grid to allocate chunks so moving back and forth a small distance can't cause a chunk to repeatedly appear and disappear)

fallen ocean Feb 8, 2025, 1:00 PM

#

meager marlin If the pointer falls off the grid, then delete the chunk (you can define a small...

This was what I was going for at first, but seemed fairly complex at that time
But atleast now I know how to do memory management (by not doing any allocations 🙂 )

#

As much as I want to improve the chunk eviction logic using a grid or something, I hate how the voxels look, so next up might be getting basic lighting, some procedural terrain gen, and then I'll come back to perf improvements

meager marlin Feb 8, 2025, 1:02 PM

#

Yeah, you got your perf win already

#

Worry about the .5ms later lol

fallen ocean Feb 8, 2025, 1:04 PM

#

Hehe true
I'm honestly suprised how much perf wins I can get by just not being stupid

Before I use to wonder what is my bottle neck and try random optimizations, but now that I got tracy setup I know exactly whats wrong and what to fix first

fallen ocean Mar 15, 2025, 11:21 AM

#

https://tenor.com/view/revive-un-dies-revives-back-to-life-resurrect-gif-21804877

Tenor

#

So I decided to finally get good, so working on this project again
On a completely unrelated sidenote, I really don't like ImGui's new dx12 init api.
You basically have to pass a alloc function that can be used to alloc new descriptor's. However, the type definition is

 void                        (*SrvDescriptorAllocFn)(ImGui_ImplDX12_InitInfo* info, D3D12_CPU_DESCRIPTOR_HANDLE* out_cpu_desc_handle, D3D12_GPU_DESCRIPTOR_HANDLE* out_gpu_desc_handle);

#

No place for a 'this' ptr, so can't really use a member function for this.
Create a lambda func and pass it in? Nope, need to use a capture list and I don't think I can use that with the above func call, because lambda cant decay to function pointer if something is captured

meager marlin Mar 15, 2025, 11:36 AM

#

you should raise an issue on gh if you think a user pointer would be helpful

#

alternatively, copy the backend into your source tree and fix it locally

fallen ocean Mar 15, 2025, 11:37 AM

#

meager marlin you should raise an issue on gh if you think a user pointer would be helpful

Was trying to figure out if I was maybe missing something, but I think a user pointer could be useful here

meager marlin Mar 15, 2025, 11:38 AM

#

I did that with the vulkan backend and ended up changing quite a few things

fallen ocean Mar 15, 2025, 11:38 AM

#

meager marlin alternatively, copy the backend into your source tree and fix it locally

I used a global variable to fix this issue (as there is a comment that indicates for now only a single descriptor would be used, but it might change in the future

#

the old api is still usable, but I switched to VCPKG and the package there hasn't been updated, because of which when I use the latest vcpkg package imgui fails to initialize 😅

fallen ocean Mar 15, 2025, 12:49 PM

#

Lmao ImGui_ImplDX12_InitInfo already has a UserData void pointer, only noticed this while writing the issue 😭
Ofcouse ocornut thought about something like this, massive skill issue for me

fallen ocean Mar 15, 2025, 3:46 PM

#

quickly put together some procedural generation stuff (using fast noise lite)

#

but right now its running at 30fps even when no chunks are loading. Without proc gen it was 144FPS 😦 will do some profiling tomorrow to see what went wrong

fallen ocean Mar 20, 2025, 12:40 PM

#

The issue was I was doing meshing and generation for each voxel one by one, so chunking was not really working

Also have to put this project on hold cuz I might go to jail || currently awaiting approval from employer to work on this project as its technically open source. Worst case I make the repo private and continue working on it ||

fallen ocean Apr 12, 2025, 7:40 AM

#

fallen ocean The issue was I was doing meshing and generation for each voxel one by one, so c...

|| this got denied, so working on the project in a private repo ||
Finally some more progress, I am now using rebar, so switched to using GPU_UPLOAD heaps is many places. need to check if this has any major perf impact

fallen ocean Apr 17, 2025, 4:34 PM

#

This is a chunk apparently
experimenting with using 1 dispatchmesh per chunk, not really possible due to limitation that num.vertices and primitives <= 256
but still what is going on with this chunk

fallen ocean Apr 18, 2025, 12:32 PM

#

getting somewhere
I'm basically using a amplification shader to do the meshing work (for a 4x4x2 chunk), and each individual triangles are processed by a simple mesh shader || (with thread block size as 1x1x1 💀 ) ||
Will fix it once this AS + MS simple case is done

fallen ocean Apr 18, 2025, 3:05 PM

#

Tis done
1 amplification shader per chunk, and mesh shaders process triangles & vertices (32 'meshlets processed' in a thread group this time, where each meshlet = mesh for a singel cube face or 2 triangles)
The amplification shader does all the meshing (nothing fancy, just dont render face if 2 voxels are next to each other kind of thing), so CPU rn just needs to fill a uint32 where each bit tells if a voxel in the 4x4x2 chunk is active or not
no more R16 index buffers on CPU, so memory usage is heavily reduced

Plus, the mesh shader does backface culling so perf is a little bit better

Next step is to learn DXR as I'm sick of these ugly ahh voxels

#

fallen ocean May 11, 2025, 1:16 PM

#

fallen ocean Tis done 1 amplification shader per chunk, and mesh shaders process triangles & ...

DXR is definitely harder than I thought 🙂 || and I'm not able to get time to work on this project, not a excuse ||
Right now I'm using ray pipelines (and not ray query), and got the bare minimum down: shadows, reflections, and transperancy

I'll probably do a bit more RT in the meanwhile, and eventually (if I don't get skill issues) try custom intersection shaders that I can use for voxel rendering

#voxel-engine, a creatively named Voxel engine