#voxel-engine, a creatively named Voxel engine

225 messages Β· Page 1 of 1 (latest)

fallen ocean
#

A small c++/d3d12 voxel engine made for learning about GPU's, optimizations, and voxels

#

Feels very strange posting about this tiny learning project in front of the massive projects here, but I decided to get back into GP by making a voxel engine using the tools I am oki-sh with, C++ and D3D12.

Some principles I want to follow during this project:
(i) Almost everyday updates (accountability for myself)
(ii) No over engineering, no unnecessary abstractions
(iii) Fast, optimized and clean code
(iv) Making something other than the generic sponza renderer
(v) Learn what 99.9% of my code actually does (i.e when I link a library what happens, what happens when I do X, and Y, etc....)

fallen ocean
#

[Day 2]

Today I focused mostly on a simple renderer abstraction. Nothing crazy, only abstraction it really does now is buffer creation.

Also made some small changes to the main rendering code, and now have multiple chunks being rendered.
The app load time is really slow (chunks being created I suppose), so that is something I have to focus on soon

fallen ocean
#

[Day 3]
So today I tried making a simple chunk loading system, where if the camera enters a new 'chunk area / zone', a new chunk is loaded. I'm re-using the same chunk data over and over, just want to get the math done right.

Unfortunately don't have any fancy screenshots as I couldn't get it working 100% right, but pretty sure by tomorrow I will hvae this basic chunk loading system done

fallen ocean
#

[Day 4]
Finally found a reason to use line list, now I have a debug visualization of the various chunks in the scene. Hopefully this will make it easier to visualize the chunks when working on the loading system

fallen ocean
#

I implemented this very basic naive chunk loading algorithm, not very inefficient but gives me a baseline to implement more things

I've switched stuff around my project so that a Chunk is just a u64 chunk_index. The actual cubes (that each voxel represents) is moved over to a hashmap of {chunk_index, Cube*} so that when I move chunks between the different vectors (loaded, unloaded, and chunks_to_load), I don't have to worry much about move / copy constructors, destructors being called, and such.

Next step is going to be learning how to procedurally generate chunks (rather than just give them all a random color).
Once that is done, I will come back to the existing code and try optimizing it a bit.

#

That gray background is my debug grid draw, should probably disable it when posting future videos

fallen ocean
#

[Day 5 and 6]
There is some small mis-calculations I am doing in figuring out which chunk the user is currently in (I was not facing this before because voxel size was fixed to 1). My end goal is to have tiny voxels but figuring out this math issue should give me a better understanding of the voxel / chunk coordinate system

fallen ocean
#

[Day 7]
first, I've fixed yesterdays calculations. I was dividing the camera position with the wrong constant, so the "current chunk index" computation was wrong.

I've change the code a bit so that even if X number of chunks are to be created, each frame only Y are done so (need to figure out how chunks next to player need to be loaded first).

However, now I'm getting a lot of issues with comptr (internal release). I think this is because of some copy constructor related activities. I will refresh about rule of 5 today, and try using moves as much as possible (and try fixing this internal release issue that randomly occurs sometimes during assignment operator here :

fallen ocean
#

[Day 8]
A short demo of the new and improved chunk loading system (still very naive). The demo renders only 1 voxel per chunk (render distance / how far chunks are loaded from player = 9)

#

The chunks near the player and loaded first, and then ones far away (until chunk_render_distance). I also decided to us a stack instead of chunk to determine which chunks to create first, because if a player goes very quickly to new region, those new chunks must be created and rendered first

fallen ocean
#

A small optimization : If any of the faces of "voxel cube" is being covered by voxel cube, create the vertex buffer in such a way that those faces triangles are not rendered.
This simple optimization makes moving through the world "not be very laggy"
In this photo, only 1 in 5 voxel 'cubes' are set active, inorder to not have the image be a large wireframe mess.

fallen ocean
#

A video showing current progress. Next step -> Move the chunks that are loaded but covered by all sides by other chunks into unloaded chunks array.
(Unloaded chunks are simply not rendered, memory is still allocated which sometimes causes program to crash. This will be solved when some serialization logic is implemented)

#

what a crisp video thanks to compression froge_love

fallen ocean
#

Starting working on procedural world gen, but discovered a few bugs along the way (looks like Z fighting, chunk loading is a bit strange at random times)

fallen ocean
#

Long time since I"ve posted here, but currently working on CPU frustum culling. Have most of the code done, just have to fix some calculation issues after which I'll post a screenshot here

fallen ocean
#

Followed this blog : https://fgiesen.wordpress.com/2012/08/31/frustum-planes-from-the-projection-matrix/ to implement cpu frustum culling by extracting data from the projection matrix. Not 100% working as I'm facing lots of skill issues, will have to go over it again

However, even with whatever partial culling is happening, I can finally traverse large scene without much issue. Loading chunks is causing lot of lag, have to figure that out

#

This project is uncovering lot of skill issues I had, will have to go through the entire code base and try finding out why its so slow

fallen ocean
#

About 25% done with the project re-write. Last time I made things too unnecessarily complex, instead of trying various approaches (to say voxel storage and rendering).

Trying to fix the issues with the project this time. Currently got imgui, bindless rendering in the project, will next try to work on a greedy meshing algorithm

meager marlin
#

imo you should start with "naive" meshing

#

it's simple and doesn't have the annoying edge cases (T-junctions, when to split quads, UVs) like greedy meshing

#

if you need to reduce the number of vertices after that, then you can look at greedy meshing

fallen ocean
meager marlin
#

ah

meager marlin
#

I can't tell if it's applied per face

fallen ocean
#

It is

#

I suppose that naive meshing was good enough, but the project was sill really laggy and very messy
As you suggested I will re-implement naive meshing and then make the code more straightforward and simple, get frustum culling, multithreaded voxel loading and see how well it performs

#

If it is still slow then, I will go for greedy meshing and stuff

meager marlin
#

I'd first suggest profiling to see why it was laggy

#

Multithreading the creation of chunk meshes means that your meshing algorithm can be pathetically slow and still not cause noticeable hitches for the user

#

It's kinda hard to do without shooting yourself in the feet though

fallen ocean
# meager marlin I'd first suggest profiling to see why it was laggy

Tbh if I stayed at the same place and rotated the camera, the FPS was fine (so rendering wasn't a issue)
The problem was loading / unloading chunks each frame, and creation of chunks was also slow. I was using very small chunks (16x16x16 where each voxel has a length of 1, which is maybe why it was so performance heavy?)

meager marlin
#

I think it'll stutter a lot with no multithreading, no matter the algorithm

#

Unless you do like 1 chunk per frame

fallen ocean
#

That (1 chunk per frame to load) didn't stutter but obviously loading chunks around the player took 2-3 business days, so have to get some simple multithreading done

#

I've used async / future before and found it very nice to use, will probably try that out

meager marlin
#

hm

#

might be a nice excuse to try out c++20 coroutines

#

you can have a thread pool of coroutine schedulers (you'd need two libraries for this)

#

or you can shrimply use a thread pool and some concurrent queues

fallen ocean
fallen ocean
#

So far I've re-implemented naive meshing. Was in progress of making a multithreaded chunk loaded system (using async/future first, and then once that works move onto a co-routine based thread pool).
Also, I just came across this : D3D12_HEAP_TYPE_GPU_UPLOAD and honestly think it can make my life much easier, so going to try that out

fallen ocean
#

My GPU doesn't support it 😦 so back to using upload heaps

fallen ocean
#

Async copy with a multi-threaded chunk loading system (using std::future/async) to load chunks with no lag! (using 2 different command queues, one for copy and one for direct)
Note that each cube in the video is a chunk with very small voxels. I'm using a sort of hacky mutex / condition_variable approach so have to review the code, but after I'm sure the program is correct, will work on a cpp20 - co-routine based thread pool

#

When I move the camera there is some slight artifacts (feels exagerated with game capture), but I will try to fix that up soon

fallen ocean
#
     for (size_t k = 0; k < chunk_manager.m_loaded_chunks.size(); k++)
        {
            const size_t i = chunk_manager.m_loaded_chunks[k].m_chunk_index;

            const VoxelRenderResources render_resources = {
                .position_buffer_index = static_cast<u32>(chunk_manager.m_chunk_position_buffers[i].srv_index),
                .color_buffer_index = static_cast<u32>(chunk_manager.m_chunk_color_buffers[i].srv_index),
                .chunk_index = static_cast<u32>(i),
                .scene_constant_buffer_index = static_cast<u32>(scene_buffer.cbv_index),
            };

            command_list->SetGraphicsRoot32BitConstants(0u, 64u, &render_resources, 0u);
            command_list->DrawInstanced((u32)chunk_manager.m_chunk_number_of_vertices[i], 1u, 0u, 0u);
        }
#

This has to be the most hidden bug I have ever faced 🫠 took me a while to find what it was, but once you see it you can't unsee it
In the image above, I load chunks from indices 1 -> N. The above demo hit a use case for which the above loop works fine

meager marlin
#

I'm guessing you put k where an i should've been or vice versa

#

those variable names are footgunny

fallen ocean
# meager marlin I'm guessing you put `k` where an `i` should've been or vice versa

No, loaded chunks is a hashmap and not a vector (even I forgot about this tbh)
Say the current (and only) chunk loaded has chunk index 100
First k = 0, and because of m_loaded_chunks[0], a new chunk at index 0 is created and size of loaded chunks becomes 1
Then when k = 1, size of loaded chunks array becomes 2, a new chunk is created for loaded_chunks[1] and so on

So basically chunks where being created rapidly and the chunk index was always 0, and I couldn't even see the new chunks being loaded since they were all at the SAME position

#

So when I wanted to render a single chunk, I would render a million chunks but they all has same position, which really makes the program lag

#

I switched to a range based for loop for the hash map (loaded_chunks), and now the program does not run at 15 fps frog_dum

fallen ocean
#

Not entirely sure how my program fails at rendering cubes, will have to figure this out soon
I really hope its not some error in the copy queue implementation

#

The bottom right blue cube in renderdoc's VS output is clearly wrong, so some calculation is being messed up

fallen ocean
#

I implemented reverse z, and all the above artifacts are gone πŸ™‚

#

I think what has happened is that with soo much precision given to objects very close to the near plane, when moving a bit away from camera (even a little), even on the same flat plane after projection transform the z values are varying a lot

#

I did a check in renderdoc and literally for the above image the depth buffer had values in the range of [0.9999 and 1]. With reverse z, I made minor changes to the code:
(i) Reversed near and far plane when providing input to projection matrix function
(ii) clear value of depth buffer to 0.0 instead of 1.0
(iii) comparison func from Less than -> Now it is greater than

In render doc, for a above like the above case, depth values range from [0.4 to ~0.1]. Now all the precision is not taken by objects very close to near plane, and visual artifacts are vastly reduced

fallen ocean
#

Did a small profiling for sanity check, but the app slow down after loading a lot of chunks is indeed due to very high VPC SOL. Next step is to get indirect rendering + GPU culling

fallen ocean
#

I got (CPU clip space culling) and indirect rendering setup, but a major problem still exists (14+ms of .... nothing?)

meager marlin
fallen ocean
#

Probably. I will check in the VS profiler to see what is messing with the FPS

fallen ocean
#

Loaded up nsight systems and profiled the renderer, going to learn this tool, seems useful
I saw in this talk (https://www.gdcvault.com/play/1026202/Optimizing-DX12-DXR-GPU-Workloads) that over-subscribing to GPU memory can be a issue, and by coincidence my renderer starts to lag when the number of chunks is very large (> 50K)

fallen ocean
#

Very confusing, since now nsight systems reports that 6ms (out of 22) is because of CopyResource (for indirect rendering, command buffer for now is copied each frame from upload heap to default heap), and in the remaining 18ms... VPC has the highest throughput, but I already do culling?

#

Any ways by tomorrow / tuesday I will switch to doing culling on GPU and avoid use of the CopyResource, and then see what has to be done for the VPC throughput

fallen ocean
#

Implemented GPU compute culling, now the FPS has gone from 60 to 16 πŸ’€ safe to say something is going wrong

fallen ocean
#

Seems like I forgot to clear the counter. Will read more on UAV atomics and hopefully get this done by tomorrow

fallen ocean
#

For some reason my camera is all buggy (only for rotation). Going to spend today re-learning the math and figure out whats wrong
The only issue is : I've used the same camera class for 2 other renderers and things are working fine in those projects... Not exactly sure whats causing this issue

fallen ocean
#

Issue fixed, had nothing to do with the camera code but actually with precision issues 😦
If each voxel has a small length (between 1 to 16), camera rotation is fine. In the video above I had it set to 640.0f, which I suppose caused the issue

fallen ocean
#

As a optimization, I decided to remove position buffers (dedicated for each chunk) and rather have a single position buffer, but each chunk now has a dedicated index buffer

#

So basically rather than a float3 (12 bytes), the index buffer has same number of elements but each element is a u16 (2 byte)

However, this required using a modifying index buffer WITH indirect rendering, due to which I cannot enable GPU based validation. And also, the program crashes if I use indirect rendering (with this index buffer) when compared to doing normal rendering πŸ€”

fallen ocean
#

I fixed the issue. Basically on the CPU because of alignment rules indirect command was having a implicit padding which was not reflected in the SRV and UAV byte stride. By adding a uint for padding (on both the CPU and GPU indirect command struct), the program no longer crashes

fallen ocean
#

This is probably going to take a while to implement, but I am going to work on a circular ring buffer for chunks (both GPU and CPU)
Right now you fly around the world for a few minutes and allocate like 8GB memory, so instead I'll pre-allocate CPU and GPU mem for 10K chunks and re-use that memory for new chunks

#

I'm saying 10K chunks because my chunks are basically the size of a single minecraft voxel. Due to some precision issues I'm not able to make the voxels the same size as minecraft chunks. Maybe I should work on this first

fallen ocean
#

fp precision issues are now vastly reduced... all I did was few minor things:
(i) camera position in view matrix is hardcoded to zero. In shader, when computing vertex position I subtract the camera position from the vertex position before view/projection transform
(ii) N buffers for view projection matrix (never had to use this, but made significant improvements)

#

[67K chunks (of 8x8x8 voxels) being rendered in the above video, good to know that the renderer can handle large chunk counts]

fallen ocean
#

This doesn't look too good, but not sure if this is normal

#

RTLUserThreadStart uses ~94% of CPU. Maybe getting a thread pool integrated where I can reuse threads without creating a new one / restarting a old one will solve this issue

#

ntdll (which has thread related stuff inside) takes more CPU% than my actual engine, and in the engine code also all the threading related functions take a lot of CPU

fallen ocean
#

Never realized how expensive the overhead of a thread is

fallen ocean
#

Making some more changes for this thread pool logic, as rn all threads share a single command list via mutexes
The end goal is for each chunk loading thread to have its own command list, and each thread has a queue of command allocators

fallen ocean
#

How has the control flow entered that if loop what

fallen ocean
#

I've decided to put this project on hold, and learn systems programming & basics of gp-gpu first before continuing on this project

No particular reason, but I think I still have a lot of missing fundamental knowledge, will trying learning that before continuing on this project. Atleast I got to learn about using about multiple queues (and got a small blog post on that topic) from this project

fallen ocean
#

Totally forgot I had put some stuff over here (thanks for Deccer for the reminder πŸ™‚

I restarted the project, and implemented a custom thread pool in C++
Was a really simple thing to do, and learnt about packaged task, async / future

#

Now I'm trying to rewrite the code to make sense, and start fixing memory related issues (in previous builds, when new chunks are being created, the entire app slows down because of the massive number of id3d12resources being created)

#

So current plan is simple : create a 500mb heap, and new chunks use the same resource as the evicted chunk, but just with the chunk related values being updated. Will try to post more updates over here as I get stuff done

fallen ocean
#

New bug alert : My camera doesn't move when the game is built in release mode
and nsight and renderdoc crash (even before frame is captured)

what kind of bugs are these froge_bleak

fallen ocean
#

I found the keyboard issue, this is a bit embarrasing πŸ€¦β€β™‚οΈ

I had this assert #define

#

and GetKeyboardState was called inside a Assert, so in release mode the function was never being called at all

limber smelt
#

are you still writing C like a cave man?

#

or have you switched to c++ yet?

fallen ocean
#

I used a custom assert function mostly because of this:

The definition of the macro assert depends on another macro, NDEBUG, which is not defined by the standard library.
But fair point, I'll see what is the more modern C++ alternative for this (inline function + constexpr maybe?)

meager marlin
#

invoking ub as an assert is not a good idea as the compiler can remove it in release mode

#

you could invoke the equivalent of __debugbreak and then exit or abort

fallen ocean
#

Would have prevent the issue where function was not being called as the function param is a bool, and the compiler won't remove UB in release mode (atleast I think so?)
Plus for I'll just use __debugbreak as its available in MSVC

limber smelt
#

this looks cursed

#

just include <cassert>

#

those snippet above look like some gpt output or some copy pasta from some cpilled kid which does c++ but doesnt want to do c++ on stack overflow

#

__debugbreak works but is also not portable, but i guess you still just target msvc/d3d12?

meager marlin
#

it's easy enough to port to other environments

#

I think this function needs to abort on fail too

fallen ocean
#

well the intention was to not rely on #defs that may or maynot be present, but that shouldn't be a issue realistically

I'll switch to using normal assert itself

limber smelt
#

yeah, the other stuff is more weird : )

#

i use someone's portable debug_break because im not targetting c++26 yet

meager marlin
limber smelt
#

assert will throw an exception as the default no?

#

that ugly message box

meager marlin
#

normal assert ostensibly isn't viable

fallen ocean
limber smelt
#

yeah that one

fallen ocean
#

I've never done actual error handling / exception stuff, before, I will start working on this once I get that fancy single heap memory idea done, because after that I plan on restructuring the code properly

#

I think I found the reason execute indirect was causing nsight and renderdoc to crash. I have 2 different definitions of indirect command, one in HLSL and one in C++, and they are causing a LOT of problems for me because of implicit alignment and padding.

on C++ side:


    struct indirect_command_t
    {
        interop::voxel_render_resources_t render_resources{};
        D3D12_INDEX_BUFFER_VIEW index_buffer_view{}; # alignment of 8
        D3D12_DRAW_INDEXED_ARGUMENTS draw_arguments{}; 
    };
    static_assert(sizeof(interop::gpu_indirect_command_t) == sizeof(indirect_command_t));```

On HLSL side:
```cpp
    struct gpu_indirect_command_t
    {
        voxel_render_resources_t voxel_render_resources;
        uint padding;
        uint4 index_buffer_view;
        uint4 draw_arguments_1;
        uint draw_arguments_2;
        uint padding2;
    };```
I fixed it by doing a   #pragma pack(push, 4) on the C++ side. Plus, I have a static assert so the size of cpu/gpu indirect command is same, hopefully mitigating such issues in future
fallen ocean
#

Some metrics for reference (current application frame time when thousands of chunks are being loaded at once) :

fallen ocean
#

Uhh so according to nsight systems only one CPU core is active? This doesn't look right to me at all

#

I can see 12 different thread id's, so now really confused as to why nsight says something else frog_thinkk

A screenshot from my thread pool code and console output:

fallen ocean
#

sorry for the bajillion screenshots, but heres the culprit : each frame is filled with create commited resource calls, so the app is heavily CPU bottle necked

fallen ocean
#

Making slow progress
Each chunk just has 'offsets' into the color / index buffer data to the actual (very large) commited resources that the chunk manager holds

This means ram / vram usage is constant. Still have a lot of issues (I'm doing memcpy / copy resource for the entire buffer each frame) + there is some index related bugs, which I am working on rihgt now

fallen ocean
#

I tried having the large buffers be upload buffers (to avoid multiple copy resources per frame), but something looks fishy

#

I am not sure if this mean it takes 30ms between command list close and reset on the CPU side? that doesn't look right to me

fallen ocean
#

Time to get tracy implemented, something is really wrong on the CPU side (and I've wanted to try out tracy for a long time, so it should be a good thing to setup)

fallen ocean
#

Oh hell nah. 32ms to fill a buffer on the cpu side πŸ’€

#

Quickest optimization fix ever KEKW
Now when new chunks are loaded, I directly update the upload buffers with the chunk indices data
Now rendering is at a smooth 60FPS even when bajillion chunks are loading

Still some offset related issue, need to figure out why that is happening

#

Still something wrong with the cpu side : memory usage shoots up wayy too rapidly.
Need to have a constant allocation like what I do on the GPU side

here is a screencap with latest optimization updates :)\

fallen ocean
#

                ZoneScopedN("Chunk eviction");
                std::vector<voxel_chunk_position_t> keys_of_chunks_to_unload{};
                std::erase_if(chunk_manager.m_loaded_chunks,
                              [current_chunk_3d_index, &chunk_manager, &keys_of_chunks_to_unload](
                                  const std::pair<const voxel_chunk_position_t, voxel_chunk_t> &key_value_pair) {
                                  auto &chunk = key_value_pair.second;

                                  if ((std::abs(chunk.m_chunk_position.x - current_chunk_3d_index.x) >=
                                       CHUNK_RENDER_DISTANCE_PER_DIMENSION / 2) ||
                                      (std::abs(chunk.m_chunk_position.y - current_chunk_3d_index.y) >=
                                       CHUNK_RENDER_DISTANCE_PER_DIMENSION / 2) ||
                                      (std::abs(chunk.m_chunk_position.z - current_chunk_3d_index.z) >=
                                       CHUNK_RENDER_DISTANCE_PER_DIMENSION / 2))
                                  {
                                      keys_of_chunks_to_unload.push_back(key_value_pair.first);
                                      return true;
                                  }

                                  return false;
                              });
                if (!keys_of_chunks_to_unload.empty())
                {
                    std::scoped_lock<std::mutex> scoped_lock(chunk_manager.m_unloaded_chunk_queue_mutex);
                    for (const auto &key : keys_of_chunks_to_unload)
                    {
                        auto it = chunk_manager.m_loaded_chunks.find(key);
                        if (it != chunk_manager.m_loaded_chunks.end()) // check added just for my sanity
                        {
                            chunk_manager.m_unloaded_voxel_chunks.emplace(std::move(it->second));
                        }
                    }
                }
            }```
#

This small piece of code is causing voxel_chunk_t constructor to be called, which internally allocates memory for a new chunk (heap allocated array of voxels).
Not sure how and where the constructor is being called, as it feels really strange

tough pier
fallen ocean
#

This doesn't make sense what
Memory usage when I load chunks, don't move camera, and trivy is connected

#

Memory usage when trivy isn't connected for the same scenario (it never stops growing?)

meager marlin
#

what's trivy?

#

tracy?

#

tracy keeps a growing buffer of information to send to the client that it expects will be eventually connected

#

you can disable this via some define, but it will make any data prior to the client being connected not appear

fallen ocean
#

my bad meant to say tracy, trivy is a different tool

fallen ocean
meager marlin
#

that's also what I'm describing

#

the client consumes the info but in this case there's no client connected so it just keeps accumulating in your app

fallen ocean
#

Ahh I see what you mean now

meager marlin
fallen ocean
#

I added the tracy_on_demand macro, things work as expected now
Thought for a second that my code was really broken. It is, but not too much
Thanks for the help πŸ™‚

meager marlin
#

I experienced this issue myself hehe

#

eventually tracked it down to having ZoneScoped; in a really hot function and then someone pointed this out to me

fallen ocean
#

I see.. had a question for you (since you also work on voxel stuff)
How much Ram does your process use? in my engine, when I load around 2k chunks, VS reports 1.4-1.5 gb of memory usage. Is this too high for a voxel engine?

#

I don't compress my data at all (each chunk -> 8x8x8 voxels, each voxel is 1 bit, each chunk (on cpu) stores array of indices as u16 and a float3 color), but if this wayy to high might have to start changing things

meager marlin
#

what does task manager say when you run the exe outside the debugger?

#

vs also a button to launch without the debugger

#

vs says 3gb but task manager says 1.3 gb

fallen ocean
#

Strange, VS says 1.4 but task manager reports 800mb

meager marlin
#

the extra stuff is probably debug structures that the debugger allocates

fallen ocean
#

Perhaps, I always though for debugging had very little overhead, but I suppose its a lot more complicated than that

meager marlin
#

idk much about it either, I'm just making an inference

fallen ocean
#

ahhh stuck with some office shenanigans, so will probably get back to work on this engine sometime this week
Next plan : make all cpu side memory allocations static, followed by code that is well structured (minor refactor), error handling,etc

fallen ocean
#

Was able to make all CPU side allocators upfront.
However, now tis back to the strange perofrmance issues : FPS drops to 30fps when loading chunks, even though there is no memory allocations going on frog_dum

fallen ocean
#

Ah I see the issue now
Basically, I have a hashmap of key value pair of chunk position & index to a vector of preallocated chunks

In the renderloop, I iterate over the hashmap and render. However, in the async chunk meshing function, I have to access this hashmap. So, I added a mutex which causes the main loop to sometimes wait on the chunk meshing function

#

Need to figure out a solution to this -> either have a system where main loop has a high 'priority' and can unlock mutex at any point of time, or some other way where the main loop never has to wait on anything

#

Before I had a function to iterate over a queue of std::async's and load said chunks into the loaded chunk hashmap. Might have to do something similar this time too (even though the std::async doesn't actually return a value)

fallen ocean
#

😦 I removed all mutexes and now use a queue of std::future, but something is still seriously wrong
maybe my idea of having 1000's of chunks loaded at the same time is wrong, or something else

When new chunks are being loaded, the app temporarily drops a couple FPS, which should not happen

Now I am at a point where the main thread is itself becoming a bottleneck : The code that checks if chunk has to be evicted or not takes 10's of MS in the worstcase scenario

meager marlin
#

Btw you can create tracked mutexes in Tracy so you can see which threads are waiting on them

#

Sadly it makes the client crash on my machine for some reason (possibly I have too many threads-it only supports up to 64)

fallen ocean
meager marlin
#

Wait I have dementia you literally just said you removed them lol

fallen ocean
#

Apparently I am trying to unload 5 million chunks froghorror
Theres like 6K chunks loaded what on earth am I trying to unload

#

no wonder memory usage was soo high despite having static mem allocs

#

Curious, is this a common programming paradigm or just bad code?
I am trying to not add items to my stacks / queue if they are already present, so I use a unordered set to get O(1) look up times. Haven't seen this being used elsewhere so not sure how good this is

meager marlin
#

hmm

#

I can't think of a better way off the top of my head, so it's probably fine

meager marlin
fallen ocean
meager marlin
#

500us is still a long time

#

Is that just for the first frame

fallen ocean
meager marlin
#

ah, it needs to be faster then

fallen ocean
#

That function is called 5k times, is that even a large number of the CPU?

meager marlin
#

is this release mode

fallen ocean
#

Yeah, release with debug info
Let me check if just release mode changes anything

meager marlin
#

But I wonder if you could use a different data structure that doesn't force you to iterate over every chunk

#

Like you could use a 2D grid with modular indexing

fallen ocean
meager marlin
#

Actually this is basically the same as a problem we had when implementing virtual shadow maps

#

Which is how to deallocate edge pages when the light frustum moves

#

Actually I didn't deallocate anything in the end so nvm

fallen ocean
meager marlin
#

the "problem" is that you'd have to move every chunk around the grid whenever the player moved, which might be a little costly, though probably not too bad

meager marlin
#

I suppose I should also mention that you should only do any of this when the player moves into another chunk

meager marlin
#

Which is to literally have a grid representing chunks around the player and move the pointers around when the player moves

fallen ocean
#

Horrible compression artifacts but atleast now I can fly around rapidly (each colored cube is a entire chunk) and not run out of memory

meager marlin
#

If the pointer falls off the grid, then delete the chunk (you can define a smaller area within the grid to allocate chunks so moving back and forth a small distance can't cause a chunk to repeatedly appear and disappear)

fallen ocean
#

As much as I want to improve the chunk eviction logic using a grid or something, I hate how the voxels look, so next up might be getting basic lighting, some procedural terrain gen, and then I'll come back to perf improvements

meager marlin
#

Yeah, you got your perf win already

#

Worry about the .5ms later lol

fallen ocean
#

Hehe true
I'm honestly suprised how much perf wins I can get by just not being stupid

Before I use to wonder what is my bottle neck and try random optimizations, but now that I got tracy setup I know exactly whats wrong and what to fix first

fallen ocean
#

So I decided to finally get good, so working on this project again
On a completely unrelated sidenote, I really don't like ImGui's new dx12 init api.
You basically have to pass a alloc function that can be used to alloc new descriptor's. However, the type definition is

 void                        (*SrvDescriptorAllocFn)(ImGui_ImplDX12_InitInfo* info, D3D12_CPU_DESCRIPTOR_HANDLE* out_cpu_desc_handle, D3D12_GPU_DESCRIPTOR_HANDLE* out_gpu_desc_handle);

#

No place for a 'this' ptr, so can't really use a member function for this.
Create a lambda func and pass it in? Nope, need to use a capture list and I don't think I can use that with the above func call, because lambda cant decay to function pointer if something is captured

meager marlin
#

you should raise an issue on gh if you think a user pointer would be helpful

#

alternatively, copy the backend into your source tree and fix it locally

fallen ocean
meager marlin
#

I did that with the vulkan backend and ended up changing quite a few things

fallen ocean
#

the old api is still usable, but I switched to VCPKG and the package there hasn't been updated, because of which when I use the latest vcpkg package imgui fails to initialize πŸ˜…

fallen ocean
#

Lmao ImGui_ImplDX12_InitInfo already has a UserData void pointer, only noticed this while writing the issue 😭
Ofcouse ocornut thought about something like this, massive skill issue for me

fallen ocean
#

quickly put together some procedural generation stuff (using fast noise lite)

#

but right now its running at 30fps even when no chunks are loading. Without proc gen it was 144FPS 😦 will do some profiling tomorrow to see what went wrong

fallen ocean
#

The issue was I was doing meshing and generation for each voxel one by one, so chunking was not really working

Also have to put this project on hold cuz I might go to jail || currently awaiting approval from employer to work on this project as its technically open source. Worst case I make the repo private and continue working on it ||

fallen ocean
fallen ocean
#

This is a chunk apparently
experimenting with using 1 dispatchmesh per chunk, not really possible due to limitation that num.vertices and primitives <= 256
but still what is going on with this chunk

fallen ocean
#

getting somewhere
I'm basically using a amplification shader to do the meshing work (for a 4x4x2 chunk), and each individual triangles are processed by a simple mesh shader || (with thread block size as 1x1x1 πŸ’€ ) ||
Will fix it once this AS + MS simple case is done

fallen ocean
#

Tis done
1 amplification shader per chunk, and mesh shaders process triangles & vertices (32 'meshlets processed' in a thread group this time, where each meshlet = mesh for a singel cube face or 2 triangles)
The amplification shader does all the meshing (nothing fancy, just dont render face if 2 voxels are next to each other kind of thing), so CPU rn just needs to fill a uint32 where each bit tells if a voxel in the 4x4x2 chunk is active or not
no more R16 index buffers on CPU, so memory usage is heavily reduced

Plus, the mesh shader does backface culling so perf is a little bit better

Next step is to learn DXR as I'm sick of these ugly ahh voxels

fallen ocean
# fallen ocean Tis done 1 amplification shader per chunk, and mesh shaders process triangles & ...

DXR is definitely harder than I thought πŸ™‚ || and I'm not able to get time to work on this project, not a excuse ||
Right now I'm using ray pipelines (and not ray query), and got the bare minimum down: shadows, reflections, and transperancy

I'll probably do a bit more RT in the meanwhile, and eventually (if I don't get skill issues) try custom intersection shaders that I can use for voxel rendering