#Help with voxel game code - running slowly + other issues

1 messages · Page 1 of 1 (latest)

errant kettle
#

Hi! I've been working on a voxel game recently because my friend bet me I couldn't make one. I've been having problems with speed, and have already implemented chunking, burst compiler, greedy meshing, LODs. I definitely messed up with burst compiler as I'm pretty new to it.

Here are my problems:

  1. Slow framerate - Around 45 fps with 1, 2, 4 voxel scales for LODs and 4 ring size for all. Framerate drops to 25 with another voxel scale of 8. (All of terrain in view)
  2. When I press play it takes forever to load
  3. Weird spherical artifacts below my main terrain?
  4. Need tips to better optimize my code

Please help me. (To use my code just place both TerrainChunkMesherBurst and Perlin Generation inside an empty gameobject)

Thank you!

uncut zealot
#

First step is to profile the code - have you done that?

errant kettle
#

I was doing that just now

#

anddd it crashed

#

There is an atrocious lag spike as I run the project. I'm guessing it is to do with my use of the burst compiler. Memory and disk is maxing out as I'm typing this.

#

When I use profiler it maxes out my memory and disk

#

20 gb worth of profilling data even after getting rid of the LODs and using only voxel scale 1.

#

It crashed my computere and I was forced to restartt

cunning zealot
mental ginkgo
#

I don't see any [BurstCompile] on your classes, which is necessary for the burst compiler to find the functions, so I don't think your problems have to do with burst

#

(or I guess in a way they do)

#

a random thing I noticed, your Idx shows me your data is laid out in the order of Y then Z then X
but your loop just below it is Z, Y, X, so not processing in the same order as memory is laid out
this will cost you because of cache misses

#

you use a function pointer to sample your data which makes it impossible to burst that code; instead you could use a visitor pattern like this

interface IDataSampler { void Sample(...); }

void MyMeshProcessor<T>(T dataSampler, ...) where T : unmanaged, IDataSampler
{
}

then you should be able to jobify and burst the sampling code as well

#

also note that you cannot burst anything that passes managed types around

#

like here, where chunkMap is actually a managed dictionary

#

I also recommend you don't use Vector3Int and structs like that, but instead switch to using Unity.Mathematics for all your math types, as burst can generate better code for them

errant kettle
#

Thanks for the advice!

errant kettle
#

I'll work on changing everything to int3s and jobifying the rest of the code

#

I still don't really understand where the lag spike is coming from, as I used profiler and it tells me all the lag is coming from the perlin generation start, which is expected.

#

Is there a way to load the chunks after I press play without lagging, doing it overtime

#

probably a coroutine right?

errant kettle
main vector
# errant kettle Can you further explain cache misses

(I'll answer this one while I can, I may not be much of a help otherwise) Whenever your CPU reads data from memory, instead of retrieving that singular bit of value (like float or string), it always reads a bigger chunk into very fast access memory called cache (which is generally much faster than even RAM access) expecting you to read next something close to that value so when you next need something, it will look for the cache first and then the slower memory if the value is not in the cache. Cache miss happens when the value is not on that chunk and needs to be retrieved from the memory (which would read new chunk into the cache).

Whenever you can, you should access data from memory (from list or array for example) in the same order it is stored there. Let's say you had a list of 6 values of which only 3 fit in the cache. If you for example wanted to add up all those 6 values, you would be much better off summing the elements in order 0, 1, 2, ... , 5 instead of 0, 5, 1, 4, 2, 3. In the first case you would read the value from 0 in memory, which would read 0, 1 and 2 into cache which means 1 and 2 are already in cache. When the index 3 comes in, the CPU would again read from memory and push 3, 4 and 5 into cache which would only leave you with 2 reads from memory. In the other case the CPU would again read 0, 1 and 2 into cache but when then right after you need 5, that is not in the cache and the CPU needs to read from memory, this time leaving 3, 4 and 5 in cache. Then you again ask for index 1 which is not in the cache and this cycle would continue making most reads happen from the memory instead of the cache.

#

This is gross oversimplification of caches though. In reality there are many levels of caches with different speeds and sizes, the fastest and smallest ones located directly at the CPU chip. The reality is not as simple as the example, but the idea is to always read data as locally (in indexes close to each other) as possible, not from here and there which will cause a lot of cache misses. By accessing the data locally, there is a high change that the values we need are already available nearby and don't need to be accessed from the far away memory like RAM let alone hard drive. I'm not sure how they are classified formally, but the RAM could also be thought of as cache of sorts, which makes data access from there much faster than from the drive.

The use of caches is one of the reasons our modern computers can calculate as fast as they can. Caches are not something we have any control of in our code, it is just something that is built to the whole architecture to make things fast. The only thing we can really do is sabotage the smart system by being careless with how we access the data we need.

errant kettle
#

Thank you so much for the explanation! I

#

I'm still relatively new to optimization so thank you for the long answer.

main vector
#

Caches are not often taught at first because it is something that is implemented by the hardware and the details will quickly get very technical. Programmes also don't really have any control over caches and they are something that pretty much work under the hoods optimizing memory access automatically. On most beginner level stuff the performance difference isn't usually noticeable either but in heavy calculations where a lot of data is processed, the difference between good and bad cache optimization can make a huge difference.

In your case from what I gathered, you have some sort of 3d structure where the data is stored in a list something like this 000, 100, 200, 010, 110, 210, 020, 120, 220, 001, 101, ... where the numbers mean the x, y and z coordinates (3x3x3 in this case). If you made three for loops like this:

for z from 0 to 2
    for y from 0 to 2
        for x from 0 to 2
            Access xyz here
```You would get the values in same order as they are in the list above. Instead if you did:
```py
for x from 0 to 2
    for y from 0 to 2
        for z from 0 to 2
            Access xyz here
```You would access values in order `000, 001, 002, 010, 011, ...` which would be all around the list probably causing a lot of cache misses which as discussed above can be bad for performance. Depends on many factors starting from the hardware ending on your implementation on how much that matters but in cases like this, fixing the issue is as easy as nesting the for loops in correct order (assuing that doesn't break something else).
errant kettle
#

Alright!

#

I've integrated most of the feedback given and I've fixed the startup problem by generating parts of my LODs step by step in a coroutine instead

#

I still need to fix the weird spherical artifacts though

main vector
errant kettle
errant kettle
#

I just fixed the issue

mental ginkgo
#

the burst inspector will show you if a function or job is burst compiled

mental ginkgo
#

your function pointer is also an example of this pattern

#

however, C# function pointers are managed objects and cannot be bursted

#

so I presented an alternative implementation

#

About caches, if I may give a very short summary, the key element is to always try to access memory that is close to each other, preferably in sequence.

I can recommend this talk from one of my favourite speakers, which is related to the subject and somewhat highlights this miss in our education.
https://www.youtube.com/watch?v=FJJTYQYB1JQ

http://CppCon.org
Discussion & Comments: https://www.reddit.com/r/cpp/
Presentation Slides, PDFs, Source Code and other presenter materials are available at: https://github.com/CppCon/CppCon2019

Sorting Algorithms: Speed Is Found In The Minds of People

In all likelihood, sorting is one of the most researched classes of algorithms. It is a fund...

▶ Play video
mental ginkgo
mental ginkgo
# mental ginkgo About caches, if I may give a very short summary, the key element is to always t...

A small example, because I think this is interesting/useful.

struct MyStruct
{
  float4 A;
  float4 B;
  float4 C;
}

NativeArray<MyStruct> Layout1_ABC;

NativeArray<float4> Layout2_A;
NativeArray<float4> Layout2_B;
NativeArray<float4> Layout2_C;

// assume count to be high
for(int i = 0; i < count; ++i)
{
  MyStruct s = Layout1_ABC[i];

  float4 a = Layout2_A[i];
  float4 b = Layout2_B[i];
  float4 c = Layout2_C[i];
}

If we assume count is much higher than 3, when iterating over all A, B, and C in sequence as in the example, both these layouts will typically give similar performance, because they access roughly the same "amount" of memory, and the overhead from accessing separate memory addresses is negligible compared to the number of elements accessed.

Of course, there are optimizations and data usage patterns that can alter this, most notably due to burst's vectorization.

#

A common case where this matters is with GPU data, typically the position attributes of vertex data. Position data is often the only thing used by the shadowing pass, so it benefits from having the position data in a separate buffer from the rest of the vertex data, which is usually used together.

#

This concept is also what entity component systems are based on