#Rosy

1 messages · Page 15 of 1

broken fog
#

tri count is also not nearly as impactful as pixel count

#

unless you have massive overdraw

elfin cape
#

That was an example but as I am watching the video. It might be a 4k screen KEKW

cloud rivet
#

3840x2160 is the display I was recording

#

I'm going to draw the windows first and track those regions to avoid rendering the bg on

#

if I don't draw the bg or the triangle I get double the perf

#

if I only draw fps

#

it's the bg and something totally unrelated to drawing

#

I just need to start profiling

#

a mostly blank screen getting 77 fps is pretty ridiculous

#

this is my first C program

bronze socket
#

do you have a 144p monitor

#

144hz

#

not 144p lmao

#

my guess is that you have vsync on and it's locking to 77

cloud rivet
#

I did change my driver from studio to the game driver

#

maybe I did something there

#

yeah if I run rosy I can get way more

#

this is just my application being slow

#

I will start on a profiler using the window event tracing and performance tools

#

it'll be fun

brisk chasm
#

just saying, they are going to remove wmic with the next win11 release 25h2

cloud rivet
#

yeah planning on using win32 event tracing and diagnostic apis

#

those aren't deprecated

#

I will do what tracy does and use macros to trace in my application that are noop unless I define a #define profiling macro

#

and how I think I'll make this work is that when profiling is enabled I'll add a keyboard shortcut and it will profile for a short bit and dump a text file when it's done

#

I'll start on that today

#

my goal is to just start with a frame marker, then add support for zones, and then markers, figure out what's slow and maybe add memory profiling and just figure it out

#

I'll just work on it on a bit as things are slow and I need more profiling capabilities

#

to resolve slowness

cloud rivet
#

I might start with some UI tbh

#

we'll see

cloud rivet
#

I need to add a loop sampling thingy I guess

#

I'm also missing something

#

there's a lot of time spent somewhere idk what it is

#

ah

#

it's the render graph

#

that feels really slow

astral hinge
#

are those numbers milliseconds

#

or seconds frogstare

cloud rivet
#
void start_tr(AppContext *actx, Arena *arena, u32 trace_id) {
  if (!actx)
    fatal("no actx");
  if (!arena)
    fatal("no arena");
  if (!actx->trctx)
    fatal("no trctx");
  if (actx->trctx->num_traces <= trace_id)
    fatal("traces overflow");

  QueryPerformanceCounter(&actx->trctx->traces[trace_id].starting_time);
}

void end_tr(AppContext *actx, Arena *arena, u32 trace_id) {
  if (!actx)
    fatal("no actx");
  if (!arena)
    fatal("no arena");
  if (!actx->trctx)
    fatal("no trctx");
  if (actx->trctx->num_traces <= trace_id)
    fatal("traces overflow");
  QueryPerformanceCounter(&actx->trctx->traces[trace_id].ending_time);

  u64 duration
      = actx->trctx->traces[trace_id].ending_time.QuadPart - actx->trctx->traces[trace_id].starting_time.QuadPart;
  duration *= 1'000'000;
  duration /= actx->trctx->frame_frequency.QuadPart;
  actx->trctx->traces[trace_id].duration = (f64)duration;
  actx->trctx->traces[trace_id].duration /= 1'000'000;
}
#

whatever this is

#
LARGE_INTEGER StartingTime, EndingTime, ElapsedMicroseconds;
LARGE_INTEGER Frequency;

QueryPerformanceFrequency(&Frequency); 
QueryPerformanceCounter(&StartingTime);

// Activity to be timed

QueryPerformanceCounter(&EndingTime);
ElapsedMicroseconds.QuadPart = EndingTime.QuadPart - StartingTime.QuadPart;


//
// We now have the elapsed number of ticks, along with the
// number of ticks-per-second. We use these values
// to convert to the number of elapsed microseconds.
// To guard against loss-of-precision, we convert
// to microseconds *before* dividing by ticks-per-second.
//

ElapsedMicroseconds.QuadPart *= 1000000;
ElapsedMicroseconds.QuadPart /= Frequency.QuadPart;
astral hinge
#

aren't you supposed to multiply the result of QPC by the frequency of the cpu or something

cloud rivet
#

duration /= actx->trctx->frame_frequency.QuadPart;

astral hinge
#

ah

cloud rivet
#

so I think its ms

astral hinge
#

18 us for cpu raster seems quite fast

cloud rivet
#

sorry

astral hinge
#

idk what it's rasterizing though

cloud rivet
#

oh sorry

#

no these are seconds

astral hinge
#

ah

cloud rivet
#

1 here = 1 second

#

it's really slow

astral hinge
#

many opportunities for improvement then

cloud rivet
#

ya

astral hinge
cloud rivet
#

yeah I thought maybe you meant is that 33 seconds

#

I'm like no

astral hinge
#

lol

cloud rivet
#

the render graph is really slow

#

the cpu raster being slow is whatever, I knew that would be slow

astral hinge
#

now that you have a profiling thingy, time to spam it

cloud rivet
#

ya

astral hinge
#

I think you should make a sampling profiler too just to get a quick overview of stuff without having to instrument

cloud rivet
#

yes

#

what are you thinking specifically?

astral hinge
#

Well there is the win32 function (s) for getting the stack pointer and then the call stack from it, which you can call from another thread (so it costs basically nothing on the main thread)

#

Idk what the function is though so you'd have to do a little research

cloud rivet
#

I'll check that out

#

what does that do for me?

astral hinge
#

Anyway, you can put the call stack info into a map and then do a little math to figure out how long you spent in each place

cloud rivet
#

oh I see

astral hinge
#

You take a sample of the call stack every millisecond or so

cloud rivet
#

ohhhh

astral hinge
#

That's how the profiler works in visual studio

cloud rivet
#

that's cool!

#

I'll do that

astral hinge
#

It'll be so cool to have your own suite of tooling

#

If something lacks you can just improve it

cloud rivet
#

yeah, I want to add vk timing queries too

#

also memory use, since I have a custom allocator I can track all my memory

astral hinge
#

hmm in debug mode you could make the allocate function a macro that records the source info too

cloud rivet
#

yes that sounds very valuable

#

I should make those ms

brisk chasm
#

what is xxx_tr? xxx_trace? or xxx_tablerow? or xxx_transient?

#

did anyone tell the c people that we have enough disk space today for storing all the characters for our source codes 🙂

cloud rivet
#

I have a character budget because I am a responsible adult

#

waste not want not

#

Also clearly it would be xxxxx_trow and xxxxx_trs

#

smh

#

Nah

cloud rivet
#

There will only ever be one thing that gets to be called tr

#

The profiling code

#

I don’t want long varying names for these

#

I can spot a subsystem function easily. It’s easier to see the pattern working with my code

#

My internal functions are long and descriptive

#

These subsystem things all have a fat structure that is on the app context

#

actx->trctx->longthing.morelongerthings = bignamestuff;

#

you know what tr in trctx is because there’s only one tr

cloud rivet
#

using StackWalker is going to be some really "works on my machine" level code

#

lol

#

this is a how to, not a library

#

well it is a C++ library also, but I'm not using it

cloud rivet
#

To walk the callstack of another thread inside the same process, you need to suspend the target thread (so the callstack will not change during the stack-walking).

#

heh

#

no thanks

#

I think I don't need to walk it

#

I can just get the current stack

#

the current place

astral hinge
#

I believe you can also record the pc and then get just the current function (no call stack) from that

cloud rivet
#

pc?

astral hinge
#

program counter

brisk chasm
#

i believe its all in dbghlp.h

#

emphasis on hlp 😛

astral hinge
#

it's probably SymFromAddr

cloud rivet
#

any game worth its salt should be adding -lDbgHelp as a dep tbh misinfo

#

you know, if this game was for reals from scratch I wouldn't need that, I'd just write my own operating system on my own from scratch mined materials via a machine I built by hand using tools I also built

echo crystal
#

🪤

broken fog
#

do it it will be fun froge_evil

cloud rivet
#

nope

#

hard pass

#

sounds boring

#

I'm interested in graphics

astral hinge
#

you probably need like 30 PhDs of knowledge to begin making your own computers from scratch

cloud rivet
#

uhh I beg to differ

cloud rivet
#

well I got a stackframe, now what, I guess I do that get GetSymFromAddr64 thing

echo crystal
#

pretty cool

cloud rivet
#

I am pretty sure I will have to download the latest debughell.dll

#

for this to work

echo crystal
#

I also saw a video of someone doing a computer made with water

broken fog
#

mechanical computers are cool

wraith urchin
# echo crystal I also saw a video of someone doing a computer made with water

The first 200 people to sign up at https://brilliant.org/stevemould/ will get 20% off an annual subscription.

Computers add numbers together using logic gates built out of transistors. But they don't have to be! They can be built out of greedy cup siphons instead! I used specially designed siphones to works as XOR and AND gates and chained them...

▶ Play video
echo crystal
#

yes

#

i love standup maths (featured in this video)

cloud rivet
#

I should get SymGetLineFromAddr64 also

cloud rivet
#

tracy calls SuspendThread

#

When a crash occurs, execution in the crashing thread is redirected to the handler that was set earlier. The handler lists all threads running in the program and one by one pauses their execution, leaving only two threads\footnote{There is actually a race, which can result in another thread starting executing, as suspending all threads is not an atomic operation.} in a running state: the crashed thread, which is executing the crash handler and the profiler worker thread. This is done either by calling the \texttt{SuspendThread()} procedure on Windows, or sending the unused \texttt{SIGPWR} signal -- during profiler setup another handler was installed for this signal, one that enters an infinite sleep loop.

#

only for crashes

#

nm

#

anyway

#

it works

#
void capture_stack_tr(Arena *arena) {
  if (!arena)
    fatal("no arena");

  CONTEXT context;
  RtlCaptureContext(&context);
  DEBUG_PRINT("capturing stack %d\n", context.Rsp);

  STACKFRAME64 StackFrame;
  StackFrame.AddrPC.Offset = context.Rip;
  StackFrame.AddrPC.Mode = AddrModeFlat;
  StackFrame.AddrFrame.Offset = context.Rsp;
  StackFrame.AddrFrame.Mode = AddrModeFlat;
  StackFrame.AddrStack.Offset = context.Rsp;
  StackFrame.AddrStack.Mode = AddrModeFlat;

  if (!StackWalk64(IMAGE_FILE_MACHINE_AMD64,
                   GetCurrentProcess(),
                   GetCurrentThread(),
                   &StackFrame,
                   &context,
                   NULL,
                   SymFunctionTableAccess64,
                   SymGetModuleBase64,
                   NULL)) {
    fatal("stackwalk failed");
  }
  DWORD64 dwaddress = StackFrame.AddrPC.Offset;
  DWORD64 dwDisplacement = 0;
  const size_t symSize = sizeof(IMAGEHLP_SYMBOL64) * 1024;
  IMAGEHLP_SYMBOL64 *pSym = arena_alloc(arena, symSize);
  if (!pSym)
    fatal("OOM");
  memset(pSym, 0, symSize);
  pSym->Size = sizeof(IMAGEHLP_SYMBOL64);
  pSym->MaxNameLength = 1024;
  if (!SymGetSymFromAddr64(GetCurrentProcess(), dwaddress, &dwDisplacement, pSym)) {
    DEBUG_PRINT("nope %d\n", GetLastError());
  } else {
    DEBUG_PRINT("yup %s\n", pSym->Name);
  }
}
#
yup capture_stack_tr
#
void capture_stack_tr(Arena *arena) {
#

now I guess I just yolo capture the stack from another thread lol

#

I think that might end up with use after free? 😨

#

I think this is UB tbh

#

hrm all I need is the address

#

I'll just try it

#

tracy doesn't use RtlCaptureContext

#

wtf is it doing

#

so just have to follow where it gets the address from

#

cheating!

#

tracy uses libunwind

broken fog
cloud rivet
#

it's the rad debugger

broken fog
#

that's rad

#

(sorry)

cloud rivet
#

it's just windows atm I think

#

it's called rad because it's made by the rad games tool team inside Epic

#

idk how you get a job where you can just make some open source win32 debugger at company like Epic

#

All DbgHelp functions, such as this one, are single threaded. Therefore, calls from more than one thread to this function will likely result in unexpected behavior or memory corruption. To avoid this, you must synchronize all concurrent calls from more than one thread to this function.

#

well that's fine

#

A handle to the thread for which the stack trace is generated. If the caller supplies a valid callback pointer for the ReadMemoryRoutine parameter, then this value does not have to be a valid thread handle. It can be a token that is unique and consistently the same for all calls to the StackWalk64 function.

#

well

#

You cannot get a valid context for a running thread. Use the SuspendThread function to suspend the thread before calling GetThreadContext.

#

I don't think this is possible

#

with windows apis

cloud rivet
#

fuck all this

cloud rivet
#

it was cool getting the current processes stack though

#

I'm going to write a macro to just my existing profiling and wrap function calls I think are slow

#

I can track nested profiling since it maintains a state in the trace context

#

I learned a lot of stuff

#

event tracing is probably thing to use for this at some point

cloud rivet
#

I'm going to add profiling info to my render graph so it records how long each part of the graph took

cloud rivet
#

even with event tracing I'd have to instrument my code

cloud rivet
#

it all makes sense

#

cool

#

ok, I'm just getting data still

#

I'll think about solutions later

#

really happy with my render graph tbh, it was easy to hook perf onto it

#

this is my render graph code now:

void render_gfx(AppContext *actx, Arena *arena) {
  if (!arena)
    fatal("render_gfx: null arena");
  if (!actx)
    fatal("render_gfx: null actx");
  if (!actx->gctx)
    fatal("render_gfx: null gctx");

  if (actx->gctx->minimized)
    return;

  start_render_graph_tr(actx, arena);
  render_node_t *current_node = actx->gctx->render_graph;
  while (true) {
    render_node_t *next_node = pump_graph(actx, arena, current_node);
    add_render_graph_tr(actx, arena, (u32)current_node->render_node_type, current_node->name);
    if (!next_node)
      break;
    current_node = next_node;
  }
  end_render_graph_tr(actx, arena);
  return;
}
cloud rivet
#

@elfin cape is correct, filling that full bitmap takes no time at all, it's all other stuff

#

ok next thing I have to build is getting the average time spent in a function and the full time spent in a function per frame

#

this is all cpu bound

#

pretty sure?

#

I don't know actually

#

this is great, I was going to do a thing where I don't draw the background in areas I drew the window, and now I know that wouldn't help much at all I'd be saving in the least expensive part of the rasterization

#

that app perf window is hard to read I should nest the zones

cloud rivet
#

I actually need a way to do that for any zone, find the average time in the zone, the total time spent in the zone per frame

cloud rivet
#

now need to do the average & total time per frame thing for each zone

#

I was thinking of just using the mapped host visible vk buffer for the bitmap instead of a cpu bitmap to avoid a copy

#

I still have to get more data, but I should cache the characters also

elfin cape
#

Are you using scanline algorithim for rasterizing a triangle?

cloud rivet
#

it's just a really dumb bounding box for the triangle, then do a barycentric coordinate check for each pixel, I'm going to get to that

#

I knew that would be slow

elfin cape
#

just checking froge

cloud rivet
#

oh the render gfx should be nested under run app

#

once I have the data for per frame averages/total time of the slow stuff that runs in a loop I can start finally working on actually improving the perf

cloud rivet
#

I might dump clang language extension vectors and matrices, they are a bit wonky I think

#

as in

#

they create weird padding and alignment issues on structs?

#

it's very strange

broken fog
#

but writing vecmath in c without them is agonyfrog

cloud rivet
#

what are you using?

#

I don't really care about operator overloading

#

sure it's more readable to use operators, it's not the end of the world

broken fog
cloud rivet
#

ah

#

could just be a me issue

#

I guess I could look at compiler explorer the next time I run into a problem and try to understand it better

cloud rivet
#

hrm

#

I'm not doing that right

cloud rivet
#

ok

#

it's just for loops on the rasterization I think

#

the total time inside the loop is far less time than just the work of iterating the loop

#

unless I have a bug in this logic, which I don't think so

cloud rivet
#
          start_tr(actx, arena, actx->crctx->traces[18]); // x loop start
          {
            for (i32 x = x_min; x < x_max; x++) {
              start_tr(actx, arena, actx->crctx->traces[21]); // bb col start
              {
                float4 pos = pixel_to_sc(actx->crctx->bitmap_width, actx->crctx->bitmap_height, (f32)x, (f32)y);
                start_tr(actx, arena, actx->crctx->traces[23]);
                float4 bc = barycentric_coords(pos, t1);
                end_tr(actx, arena, actx->crctx->traces[23]);
                if (bc.x >= 0.f && bc.y >= 0.f && bc.z >= 0.f) {
                  rgba_to_rgba16(bc_mix(t1.colors, bc), pixel);
                }
                pixel++;
              }
              end_tr(actx, arena, actx->crctx->traces[21]); // bb col end
            }
          }
          end_tr(actx, arena, actx->crctx->traces[18]); // x loop end

#

it's just that for loop itself

#

for (i32 x = x_min; x < x_max; x++) {

#

idk

#

nesting these traces is really error prone

#

I need a better way

#

anyway

#

I think I got it

#

enough to go on

#

the more traces I showed in the UI the slower it got lol kekkedsadge

#

now numbers just have to go up

#

or down

cloud rivet
#

got it a little bit faster

#
            for (i32 x = x_min; x < x_max; x += 6) {
+              f32 x_f = (f32)x;
               start_tr(actx, arena, actx->crctx->traces[20]); // bb col start
-              {
-                float4 pos = pixel_to_sc(actx->crctx->bitmap_width, actx->crctx->bitmap_height, (f32)x, (f32)y);
-                start_tr(actx, arena, actx->crctx->traces[23]);
-                float4 bc = barycentric_coords(pos, t1);
-                end_tr(actx, arena, actx->crctx->traces[23]);
-                if (bc.x >= 0.f && bc.y >= 0.f && bc.z >= 0.f) {
-                  rgba_to_rgba16(bc_mix(t1.colors, bc), pixel);
-                }
-                pixel++;
-              }
+              cr_test_triangle_pix(actx, arena, pixel, x_max_f, x_f, y_f);
+              cr_test_triangle_pix(actx, arena, pixel + 1, x_max_f, x_f + 1.f, y_f);
+              cr_test_triangle_pix(actx, arena, pixel + 2, x_max_f, x_f + 2.f, y_f);
+              cr_test_triangle_pix(actx, arena, pixel + 3, x_max_f, x_f + 3.f, y_f);
+              cr_test_triangle_pix(actx, arena, pixel + 4, x_max_f, x_f + 4.f, y_f);
+              cr_test_triangle_pix(actx, arena, pixel + 5, x_max_f, x_f + 5.f, y_f);
+              pixel += 6;
               end_tr(actx, arena, actx->crctx->traces[20]); // bb col end
             }
#

adding more doesn't really make it much faster

#

I'll add something to avoid areas that I know won't have any part of the triangle in it

#

that should reduce things

#

I can do these concurrently too

#

and use simd I guess

#

I'm going to look at my text rendering next that's also slow af

#

for the same reason

#

hrm I just need to do less here with the characters, by caching them I think

#

I shaved like 5 ms by doing the same thing to chars, only that's more complex and breaks character rendering

#

I'm going to add character caching

#

it's a whole lot of just do less

broken fog
#

that's an easy win if it's not

#

well, "easy", you know what i mean KEKW

cloud rivet
#

I don't want to do that yet

#

I think there's a way to get this to be faster just single threaded

#

one thing is to just reduce the scale, I probably shouldn't be going 4k with a cpu software rasterizer

broken fog
#

ye KEKW

broken fog
#

mt will add complexity so getting it to be as fast as possible single threaded is a good idea

#

but keep in mind some optimizations may be very well suited to a multithreaded renderer

cloud rivet
#

yes, I plan on getting there

broken fog
#

any changes will be far more obvious

#

your idea about avoiding empty areas of the tri is pretty solid, you could look into dividing the tri into tiles or something like that

#

just thinking out loud but maybe a lil quadtree (2 or 3 levels depending on the size of the tri, no more) could speed things up significantly

cloud rivet
#

yes

#

I'm going to get rid the ability to scale a window, it looks dumb and adds a lot of complexity

cloud rivet
#

I can't really added the tracing code I have into the hot loops, the profiling itself distorts the profling

#

well

#

I did learn something from it

cloud rivet
#

doing much better already

#

haven't even done the character caching yet

#

was at 28 fps before with this amount of text

#

with less text actually, and less profiling, so it's faster now with more text and more profiling

#

yeah I am going to cache the characters next

#

the next thing I'll do is try to use the host visible vk buffer for the bit map instead of a separate set of bytes

#

I'm hoping I can get 3ms per frame out of those changes

#

after that I'll add a slider ui widget

#

and then I'll use that to dynamically adjust the scale to find a good perf for this window size

#

if I shrink the window to this

#

I get 60 fps

#

hrm

#

I think the windows capture drops the frames a little

#

after font caching, a good scale, I'll add multithreading

#

font caching + vk buffer, scale and multithreading

#

and the multithreading will I think be generating frames in the background, thinking I'll have a fif type deal

#

once I load more triangles I'll probably have to resort to simd and 32 byte aligned operations

cloud rivet
#

to

#

rendering even more text

#

my windows and ui basically don't cost anything anymore

#

tomorrow I'll try and use the vk host visible buffer as the bitmap

#

and see if that makes things faster

#

then I'll add multithreading and go back to working on adding some ui features and start actually doing some 3D maybe, finally

elfin cape
#

what was the thing that was slowing down the text so much? 9ms -> 0.6ms speed up is really nice

cloud rivet
#

I had a unecessary bounds checking and I removed scaling code and then I added font caching

#

I reduced the size of the function that rendered text quite a bit

elfin cape
cloud rivet
#

as long as it's faster than copying the full bitmap to the vk buffer it should be a win I hope

#

it's costing me a full ms

elfin cape
cloud rivet
#

that's not even the submit,

#

I think the submit must be part of the wait? I don't know

elfin cape
#

ngl I would do normal hw raster KEKW

cloud rivet
#

oh you mean just use the win32 apis?

elfin cape
#

render triangles using vulkan KEKW

cloud rivet
#

ohh

#

yeah I am going to do that too

elfin cape
#

oh okay

cloud rivet
#

I have a gpu software raster pipeline set up

#

and a graphics pipeline

#

and a rt pipeline

#

I'm going to just work on them all

#

when I get bored with one I work on the other

cloud rivet
#

oh another thing I did to make ui render faster is just draw the windows bg all at once

#

and I write the first row to memory and then just copy it to the rest of the rows

#

that made things go faster too, I had all this complex, and buggy, logic with padding and nonesense and I just removed it all and just draw a square, and then render font on top of it instead what I was doing before where I drew some part of the window, then some font the rest of the line, etc

cloud rivet
#

I think I need the concept of a pixel

#

right now I am doing a serial process of here's my current pixel, let me convert it to coordinates, now let me do barycentric coordinates, does it pass or fail? mix a color, write it to bitmap, move on to next pixel

#

what I think I should do is actually uh

#

break all that up into layers

#

get a bunch of pixels all together at once

#

get their coordinates

#

get their barycentric coordinates

#

do the check

#

mix a color

#

write to bitmap

#

then I can do simd I think

#

the barycentric corodinate math would be a bunch of instructions based on how that's currently written

#

as would the mix

#

so I have to rewrite those

#

I think I should do this

#

I am going to test the vk buffer host visible mapping, maybe that's slow? I don't know

#

I'm also looking into hardware counters for profiling

#

specifically for my hardware

brisk chasm
#

QueryPerformanceCounter/Frequency

cloud rivet
#

I have that already

#

it's slow af

#

I need to sample

#

I think

#

I need to figure out how to do them periodically

#

I am currently looking at model specific registers for intel though

#

apparently my hardware sucks, Core i5-10400, I want all the new things

#

the AMX stuff sadcat

#

let me try this crazy vulkan buffer idea

#

maybe it is slow

#

I think I need to actually make a render graph node for the cpu rasterizer for this to work

#

since I have a per fif staging buffer

#

honestly doing the UI and cpu raster in a render graph makes a lot of sense thinkeyes

#

it is rendering

astral hinge
#

also, what GPUs do is hierarchically rasterize blocks of pixels

#

so you start by testing all the large squares of pixels the triangle's AABB touches, then narrow it down a few times

#

at the lowest level it might be 8x8 or something

#

which you can SIMD-ify pretty well

cloud rivet
#

nice

astral hinge
#

well I guess the actual lowest level is 2x2 since those are quads of invocations. you might be interested in that for derivative calculations

#

you can even write your own "shaders" that are just callbacks

cloud rivet
#

well a shader I can write to be a sequential set of operations on a per pixel that the gpu driver actually parallisizes all the instructions for yes?

#

but on the cpu side I have to do this myself I think?

#

hrm

#

I see what you mean I think

astral hinge
#

yeah auto parallelization of shaders would be super hard

cloud rivet
#

I'm not sure what derivative calculations you are referring to

astral hinge
#

unless you write a vm or something

astral hinge
cloud rivet
#

right right

#

I want to use those for tangent generation also

#

I just vaguely know about it, I have to read through it to understand the actual math

astral hinge
#

the math is pretty frog_shrimplele

cloud rivet
#

nice

astral hinge
#

it's just a single subtraction

#

#opengl message

cloud rivet
#

yes that's right I have that in my notes from when we talked about it before

#

when I was looking at spirv stuff

astral hinge
#

yeah so your gpu code is automagically SIMD because the compiler needs to do that anyway

#

but on the cpu it's tricky because your "shader" won't run in blocks of 2x2 threads in lockstep

#

that's why I suggested a vm earlier

cloud rivet
#

yeah, I'm just going to avoid trying to replicate shaders where I work on one pixel at a time, and do CPU specific work in blocks I think

astral hinge
#

if the work unit is 2x2 then you can hand-write simd shaders if you want, or just do a loop over the pixels

cloud rivet
#

that would be a big perf win I think 2x2 simd

#

and not to hard to think about

#

as in the code won't be incredibly complex

#

I'm going to try that thanks

#

the call stack from another thread was a big no go btw

#

since you have to suspend threads for that to work

astral hinge
#

damn

#

I need to see what my profiler code was doing

cloud rivet
#

was that on linux?

#

it's different there

astral hinge
#

it was windows

cloud rivet
#

yes I am curious about your profiler code also

astral hinge
#

I have my school projects scattered around my drives so I need to search

cloud rivet
#

suspending a thread is horrific, if it is holding on to sync primitives for resources shared with other threads

astral hinge
#

I'm pretty sure I didn't do that

#

considering it was sampling I think thousands of times per second

#

which is also the frequency of the normal vs profiler

#

huh I can't find the course files anywhere on my pc

cloud rivet
#

was this for an undergraduate class?

astral hinge
#

yeah

cloud rivet
#

feels like maybe you had unintentionally produced phd dissertation level work

astral hinge
#

no it was an assignment that everyone had to do lol

#

it wasn't that hard

cloud rivet
#

was there a textbook associated with this task

#

I'd be interested in looking at it

echo crystal
cloud rivet
#

classwork on github is sus

astral hinge
#

I checked that too

#

well there actually was a github organization that I was in for the class, but I left it

#

I thought I'd have a repo though frog_think

cloud rivet
#

there's like some academic integrity problem with doing that imo

echo crystal
#

not if it's private 🤫

cloud rivet
#

just my opinion, but there may be uni policy also

astral hinge
#

we were supposed to join the org for the class

#

and it was private

#

somehow the way we uploaded our projects was private to other students

cloud rivet
#

maybe the class was just ok with the UB of doing it without suspending the thread

astral hinge
#

lol maybe

echo crystal
#

russian roulette profiling

cloud rivet
#

btw I got that 11" ipad air with a pencil and I love it

echo crystal
#

wow pretty interesting

cloud rivet
#

have you guys ever seen The Deer Hunter? great movie, except for the weird Russian roulette part

#

I like with the pencil how I can hover my pencil over a link and click the pencil to follow links

astral hinge
#

I wonder if all my files for that project were on a school drive

broken fog
broken fog
astral hinge
#

yeah it's just a fatter tree

broken fog
#

like more subdivisions per level?

broken fog
#

all my uni assignments are on gh and open source KEKW

#

unless the course specifically disallows it

astral hinge
#

I gave up looking for my projects for that class froge_sad

cloud rivet
#

thanks for looking

cloud rivet
broken fog
#

the latest air is m3 so if you bought new ye probably

#

mine is m1

#

honestly both are incredibly overkill for what most people do with an ipad KEKW

cloud rivet
#

yah m3

#

just checked

#

I'm just going to read pds with it

broken fog
#

yeah

cloud rivet
#

actually needs the m3 maybe for that? these pdfs are huge lol

broken fog
#

you know you have rt hardware in that thing right?

cloud rivet
#

yeah but I don't plan to write apps for gross locked down sandboxed mobile devices

broken fog
#

based

cloud rivet
#

when people tell me they're writing for android or browsers I just feel so sad about it

broken fog
#

browsers are fine tbh

cloud rivet
#

or if they're talking about it in one of the channels

broken fog
#

not the same as native dev but

cloud rivet
#

idk, not for me

broken fog
#

it's not a walled garden ecosystem like mobile

cloud rivet
#

it's still kinda shit imo but I understand people can find that interesting

broken fog
#

the web's actually kinda nice mostly because of how easy it is to ship something and share it with people

cloud rivet
#

I hate it so much lol sorry

broken fog
#

just send a url, no git clone, no build steps etc

#

but yeah i get why you'd hate it KEKW

cloud rivet
#

I think peak browsers was 1990's HTML

broken fog
#

false

#

peak websites tho

cloud rivet
#

I like being able to see images and videos like YT I guess

#

I wish I could just use lynx to browse the web

echo crystal
#

if u have adblock web is sooo much better

cloud rivet
#

yup of course

broken fog
#

the web platform itself is pretty cool

echo crystal
#

yeah

broken fog
#

if you want to build nice ui quick there's nothing better than html+css imo

echo crystal
#

probably some widget thing

#

not for me tho

cloud rivet
#

same

#

I mean for work it's fine

#

just in my personal time I don't want to do anything with it

broken fog
#

yeah that's fair enough

#

i don't feel like doing web shit on my free time either

cloud rivet
cloud rivet
#

I think work killed it in my soul

#

it's also so much work now to build a website, and so expensive. A k8s cluster cost hundreds of dollars per month

#

if you just want some static website, sure

#

you could build a server less website using aws and step functions

#

yeah you could just run a LAMP stack or Django or whatever, heh

#

in 2025 👀

bronze socket
#

basic html + crappy used pc + home wifimaxxing

cloud rivet
#

:|

#

I don't run servers in my house

#

I guess App Engine or whatever is the thing to use maybe

#

iirc snapchat was on App Engine for a very long time

bronze socket
#

I have a little nuc I got off ebay but I just use it to host static files to stream to my friends mainly + a matrix server

cloud rivet
#

I think the last time I was excited about webdev was when Meteor was a thing

#

I used to go to the Meteor HQ for meetups every month

#

then react won

bronze socket
#

I don't think I have whatever it takes to be excited about web frameworks

cloud rivet
#

yeah it's jover for me too

bronze socket
#

or tech in general

cloud rivet
#

Meteor was so much fun

bronze socket
#

I've noticed the only people still excited about tech are the ones that know the least about it

cloud rivet
#

are you not including graphics programming as tech?

bronze socket
#

I think for the most part it counts

#

when it comes to new innovation

cloud rivet
#

all I want any more involves a C compiler

#

just moving my stuff to the render graph improved perf lol

#

frame time dropped by like 3ms and got like 5 more fps out of it

#

what it was before

#

gonna try passing in the staging buffer now

#

and removing the copy

#

I don't understand why that improved performance

#

how do computers work

bronze socket
#

I assume your frame graph is smarter and emits better barriers and whatnot?

cloud rivet
#

it doesn't emit any for any of this though

#

it does for vk transitions that need them

#

lol using the staging buffer works but it is dramatically slower

#

haha

cloud rivet
#

I exchanged a 1ms copy for an extra 10ms to raster

broken fog
#

just yeet it on vercel

#

their free plan is kinda nuts

cloud rivet
#

yeah vercel would be great actually

#

maybe I can configure this buffer differently

#
 const VkBufferUsageFlagBits usage_flags = VK_BUFFER_USAGE_TRANSFER_SRC_BIT;
  const VkMemoryPropertyFlags mem_properties
      = VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT | VK_MEMORY_PROPERTY_HOST_COHERENT_BIT;
#

VK_MEMORY_PROPERTY_HOST_CACHED_BIT bit specifies that memory allocated with this type is cached
on the host. Host memory accesses to uncached memory are slower than to cached memory,
however uncached memory is always host coherent.

#

that also increased the copy to draw image time and both of the submit times by 1ms

#

it's nuts

#

I don't see anyway to do this and it not be slow

#

but

#

I removed VK_MEMORY_PROPERTY_HOST_COHERENT_BIT since I don't need that and went back to using my bitmap I didn't get a win here either though

#

I did get a huge win from just shuffling my code around?

#

ok

#

I'm going to do the simd 2x2 quad stuff jaker brought up

#

I am regularly seeing 45 fps now

#

23ms frames

#

we are doing science in this thread

#

oh

#

also I had a bug with cpu raster and resize, and that's now fixed too by moving everything into the render graph

#

my render graph is fucking magical

#

it makes things faster and fixes bugs for me

#

emergent render graph behavior

bronze socket
#

kinda hard to reliably get timings on light computation loads anyway because all the crap happening in your OS is pure noise to the timings

cloud rivet
#

true

#

I am pretty certain that this change made it consistently faster

#

I think simd, MT and reduced scale and I'll be unblocked by performance issues on the CPU software raster progress for a bit

cloud rivet
#

lol

#

you know what

#

this whole week

#

I have been running with verbose validation enabled KEKW

cloud rivet
#

still going to do the simd idea though

#

and multithreading

#

I'm so close to be finally doing 3D graphics

cloud rivet
#

oh reading intel's manual I realized I should be using fused multiply add for my dot product, all kinds of things, I could be using shuffle to set a whole bunch of color values at once based on barycentric coordinates, man

#

but even without avx I should be using fma

#

for better precision

#

as it rounds less

cloud rivet
#

I like reading the manual on my ipad on the bus, I also am reading the tinyrenderer lectures for software rasterization

#

I am pretty happy with the tech on this project. I think I have made major strides since Rosy

#

soon I will have nicer pixels

#

I'm serious about ripping out shaders and generating SPIRV with application code. I think in my UI I will make a SPIRV disassembly viewer for debugging. That's a ways off though, I want to get some 3D and lighting going in the CPU software renderer

#

also an IR view

#

maybe an IR view and I can click on or hover over the IR and see the SPIRV dis it will generate

#

I like how I added a string to my render graph nodes to document its intent when debugging I will do that with the IR too

astral hinge
#

@cloud rivet I found the code with the help of a former classmate of mine. sadly it looks like it just suspends the thread and resumes it for every sample

#

somehow it gets 1000Hz resolution though, so maybe suspension isn't that expensive

#

here's an example of the output it'd generate when you call GenOutput()

cloud rivet
#

suspending a thread is very shitty if you have a mutex for example

#

in that thread and other threads

#

maybe things can deadlock

astral hinge
#

hmm I can see how it could slow down other threads, but not how it could cause a deadlock

astral hinge
#

does that mean checking the state of a thread and only sampling the program counter when it's suspended?

#

you'd also have to make sure it doesn't wake up while you're doing that, right?

cloud rivet
# astral hinge does that mean checking the state of a thread and only sampling the program coun...
#

seems to be an issue with using CriticalSection apis and other things

#

oh

#

replied to the wrong message

cloud rivet
astral hinge
#

oh I see

#

that's smart

cloud rivet
#

The problem is that "printf" internally uses a CriticalSection;

#

haha

#

that sucks

astral hinge
#

I see yeah

#

I mean it should be ok if your profiler thread doesn't do anything that takes a lock though

#

maybe GetThreadContext acquires a lock

cloud rivet
#

the recommendation is to run stackwalk in a critical section based on the docs 😅

#

it's a cool thing to have

#

I may try it thank you for the codes

#

I will try it once I have a checkbox

#

then I can conditionally turn that on

cloud rivet
#

I wonder if the compiler already just optimizes some code to FMA

broken fog
#

probably does

#

iirc clang did on arm

#

compiler explorer is your friend

cloud rivet
#

ya

#

let me copy my dot product into that and see what it does

astral hinge
#

is it allowed if you don't have fast math?

cloud rivet
#

let me google this fast math thing

#

I saw in the intel manual actually

#

let me actually look at clang compiler args

#

-ffast-math

#

I don't see it there either

#

I don't understand

astral hinge
cloud rivet
#

128-bit Legacy SSE version: The first source and destination operands are the same. Bits (MAXVL-1:32) of the corresponding the destination register remain unchanged.

#

the registers it uses are SSE

#
MOVHLPS help

This instruction cannot be used for memory to register moves.

128-bit two-argument form:

Moves two packed single-precision floating-point values from the high quadword of the second XMM argument (second operand) to the low quadword of the first XMM register (first argument). The quadword at bits 127:64 of the destination operand is left unchanged. Bits (MAXVL-1:128) of the corresponding destination register remain unchanged.

128-bit and EVEX three-argument form
#

I'm going to compile with -ffast-math

#

hrm

#

running with smiliar frame time

#

using -ffast-math without optimizations just produces the same result as no args

#

Enable fast-math mode. This option lets the compiler make aggressive, potentially-lossy assumptions about floating-point math. These include:

Floating-point math obeys regular algebraic rules for real numbers (e.g. + and * are associative, x/y == x * (1/y), and (a + b) * c == a * c + b * c),

No NaN or infinite values will be operands or results of floating-point operations,

+0 and -0 may be treated as interchangeable.

-ffast-math also defines the FAST_MATH preprocessor macro. Some math libraries recognize this macro and change their behavior. With the exception of -ffp-contract=fast, using any of the options below to disable any of the individual optimizations in -ffast-math will cause FAST_MATH to no longer be set. -ffast-math enables -fcx-limited-range.

This option implies:

-fno-honor-infinities
-fno-honor-nans
-fapprox-func
-fno-math-errno
-ffinite-math-only
-fassociative-math
-freciprocal-math
-fno-signed-zeros
-fno-trapping-math
-fno-rounding-math
-ffp-contract=fast

Note: -ffast-math causes crtfastmath.o to be linked with code unless -shared or -mno-daz-ftz is present. See A note about crtfastmath.o for more details.

#
  • and * are associative
#

lol what

#

when is that not the case

#

oh maybe this is floating point spec stuff

#

-ffast-math also defines the FAST_MATH preprocessor macro. Some math libraries recognize this macro and change their behavior.

#

man

#

I am learning so much

#

the std library checks this

#
if the macro constants FP_FAST_FMA, FP_FAST_FMAF, or FP_FAST_FMAL are defined, the function std::fma evaluates faster (in addition to being more precise) than the expression x * y + z for double, float, and long double arguments, respectively. If defined, these macros evaluate to integer 1.
#

well

#

it checks other things

#

why am I linking to cpp

astral hinge
#

because your subconscious yearns for longer compile times

astral hinge
#

changing the order can affect the result

cloud rivet
#

this is with fma

#

CVTSS2SD help

Converts a single-precision floating-point value in the “convert-from” source operand to a double-precision floating-point value in the destination operand. When the “convert-from” source operand is an XMM register, the single-precision floating-point value is contained in the low doubleword of the register. The result is stored in the low quadword of the destination operand.

128-bit Legacy SSE version: The “convert-from” source operand (the second operand) is an XMM register or memory location. Bits (MAXVL-1:64) of the corresponding destination register remain unchanged. The destination operand is an XMM register.

VEX.128 and EVEX encoded versions: The “convert-from” source operand (the third operand) can be an XMM register or a 32-bit memory location. The first source and destination operands are XMM registers. Bits (127:64) of the XMM register destination are copied from the corresponding bits in the first source operand. Bits (MAXVL-1:128) of the destination register are zeroed.

Software should ensure VCVTSS2SD is encoded with VEX.L=0. Encoding VCVTSS2SD with VEX.L=1 may encounter unpredictable behavior across different processor generations.

#

call fma@PLT does actually not appear in the -ffast-math version

#

if I am going just by counting the number of instructions

#

to determine speed

#

not using fma would be faster

#

but

#

it's probably not that simple

#

fma is more about avoiding loss of precision I think?

#

so fma is orthogonal to ffast-math ?

#

k anyway

broken fog
#

x86 instructions don't tell you a lot more than c code

#

who knows what the cpu is actually doing at the hw level

cloud rivet
#

stupid out of order execution

broken fog
#

and it probably changes for every uarch

cloud rivet
#

jk, that makes thing goes faster

broken fog
cloud rivet
#

I was reading the intel manual today and the first chapter is about the history of intel processors

#

so it had had the 286/386/486 etc listed and what the innovations were for each architecture

#

it was fun to read about

broken fog
#

huh

cloud rivet
#

I read like 5 chapters of that thing

broken fog
#

idk when ooo was introduced

cloud rivet
#

it tells you in there

broken fog
#

iirc 386 didn't have it

cloud rivet
#

it did not

#

it's much later

broken fog
#

pentium something?

cloud rivet
#

it's like in the aughts

broken fog
#

oh

cloud rivet
#

anyway

#

it's in the pdf

broken fog
#

netburst possibly

#

or p3

cloud rivet
#

P6

#

1995

#

I was off

#

The centerpiece of the P6 processor microarchitecture is an out-of-order execution mechanism called dynamic
execution.

#

hrm

#

converting all my raster math to operate on float4[4]'s

#

will start with just hand doing what I already do

#

just operating at 4 float4s at a time

#

then change the raster code to use that

#

and then figure out how to do that with simd

#

that way by the time I get to simd I know the rest of it works and it's just getting the simd to work

cloud rivet
#

man adding -ffast-math actually makes things slower

#

wtf

#

the first f is for fail

astral hinge
#

you shouldn't use it anyway because it breaks a bunch of stuff

cloud rivet
#

what is it good for then

#

that makes sense though

astral hinge
#

it's probably only useful when applied at the function level

cloud rivet
#

I get what it's breaking sort of with floating point implementation

astral hinge
#

applying it to the whole application is just asking for trouble

cloud rivet
#

hrm

#

how do you apply a compiler arg to just a function

astral hinge
#

idk

#

but you can do it to just files

cloud rivet
#

not with unity builds lol

astral hinge
#

maybe there are pragmas that enable/disable scopes of fast math

#

which compiler are you using?

cloud rivet
#

clang 20

#

for the vector language extensions

#

and because idk I can go read the source code for it I guess

astral hinge
#

clang has #pragma float_control

cloud rivet
#

When pragma float_control(precise, on) is enabled, the section of code governed by the pragma uses precise floating point semantics, effectively -ffast-math is disabled and -ffp-contract=on (fused multiply add) is enabled. This pragma enables -fmath-errno.

#

ah

#

so disable it by default

#

hrm

#

interesting

#

I don't want fast math

#

I hate it

#

I didn't know it existed,but now I do, and I regret it

#

jk

#

I'm sure it has its uses whatever those are

#

big brain stuff

#

rewriting things to be float4[4]s is forcing me to fix my horrific raster code

#

which was just sort of written to work as I figured it out

#

so that's good

#

as long as I don't make anything slower

#

i was writing this garbage when I was recording my 1 hour videos

vagrant musk
cloud rivet
#

why do you think that?

vagrant musk
#

frogshrug I’m don’t do much at a hardware level, but I’m pretty sure floating point math is pretty fast nowadays

#

I also don’t know exactly how fast math works, but I for some reason want to assume it’s doing something in software to avoid some check?

cloud rivet
#

I think ffast-math is about taking shortcuts

astral hinge
#

ffast-math relaxes certain rules and allows the compiler to make more assumptions

cloud rivet
#

and not old hardware imo

#

this work is hard

#

hrmm

#

might be faster to just rewrite the rasterizer

#

idk idk

#

I'm gonna keep giong

cloud rivet
#

why is this making my rasterizer even faster

#

I haven't even done any simd yet

#

I was like at 25 fps 2 days ago

#

and I'm at 80 now

#

my frame rate dropped like 4ms from just migrating halfway to simd

astral hinge
cloud rivet
#

I migrated my functions like this

#
float4 cross(float4 a, float4 b);
f32 dot(float4 a, float4 b);

void cross4x4(float4 a[4], float4 b[4], float4 out[4]);
void dot4x4(float4 a[4], float4 b[4], float4 *out);
#
float4 barycentric_coords(float4 pos, p_triangle t);
void barycentric_coords4x4(float4 pos[4], p_triangle t, float4 bc[4]);
#

but

#

I'm just passing in 1 thing

#

for each of those float4 [4]

#

I'm just setting the [0] value right now

astral hinge
#

the compiler is probably auto-simd'ing it

#

autovectorization

cloud rivet
#

simd for free

astral hinge
#

what's the 4x4 naming convention btw

#

for cross and dot it makes dense because there are four of each operand

#

sense*

#

but for barycentric there's still just one triangle

cloud rivet
#

it's just like a marker that made sense with those and I kept using it

astral hinge
#

btw you can possibly get more perf by making the arguments restrict

cloud rivet
#

hrm

#

I will put this in my notes, I don't understand what that pages is talking about right now

#

thanks

#

I don't want to cargo cult it, I want to thoroughly understand what that means

#

looks super interesting

#

allows the compiler to optimize, but I have to read it carefully

astral hinge
#

putting restrict on a pointer means that a write through another pointer won't affect anything pointed to by the first

#

so it means the compiler can assume writes through some other pointer won't change the restrict data, allowing it to possibly emit fewer loads

cloud rivet
#

what does writes through mean?

astral hinge
#

just dereferencing and storing *foo = 42;

cloud rivet
#

so it's a promise that doing *foo = bar; where bar is a pointer wont' change bar?

#

oh wait

#

I get it

#

no I don't get it

astral hinge
#

I'll illustrate with a function

#
void foo(int* a, int* b, int* c)
{
  c[0] = a[0] * b[0];
  c[1] = a[0] * b[0];
}
#

so here the compiler must assume that a, b, and c can alias each other (point to overlapping regions of memory)

cloud rivet
#

oh

#

I have heard about this before

astral hinge
#

so when c[0] is written to, it must assume that a[0] and/or b[0] could have changed (because I use them twice)

cloud rivet
#

oh thanks for explaining this

#

yes I think I ran into this with zig before

astral hinge
#

if a and b are restrict then the compiler can no longer assume anything written to aliases them

cloud rivet
#

that makes sense, thanks!

#

back when I was learning zig I read through the entire spec all the time

#

and there are all these builtins

#

that are I guess kind of like qualifiers in C

#

and I encountered that I think

astral hinge
#

also you can use restrict on array parameters like this
void foo(int array[restrict 4]);

#

you can also put static inside the [] to assert that the pointer points to an array with at least 4 elements

cloud rivet
#

this is a C only thing?

astral hinge
#

yeah

#

without static, it's just a normal pointer parameter disguised as an array

cloud rivet
#

is this a new use of the word static?

astral hinge
#

yeah

cloud rivet
#

that thing means so many things

astral hinge
#

In each function call to a function where an array parameter uses the keyword static between [ and ], the value of the actual parameter must be a valid pointer to the first element of an array with at least as many elements as specified by expression:

cloud rivet
#

oh

#

that's cool

astral hinge
#

I think it's worth trying in your perf-sensitive math functions. it probably helps with autovectorization

cloud rivet
#

I should be using that everywhere

#

thanks

#

you know a lot

astral hinge
#

idk what static could help with

#

oh it implicitly asserts the pointer is non-null

#

using [static 1] to document/assert non-null arguments is interesting

#

ugly syntax though

cloud rivet
#

ya

#

my debug build is getting slower and my relese build is getting faster

#

actually

#

I can't really tell on the debug

#

because it's such a tiny number

#

it probably doesn't mean anything

#

it's been hovering around the same values tbh

cloud rivet
#

I haven't touched the triangle yet, that's just the background

#

_mm_storeu_si128 is so fast

cloud rivet
#

I think I've been benefiting mostly accidentally from compiler optimizations as a result of changing my code and adding the compiler flags to enable simd

#

so after moving the rest of the rasterizer to the 4x4 SIMD, I have a plan for MT

#

I am thinking I will have 4 threads that each individually iterate over 1 of the quadrants of all 4x4 quads in a surface in NDC

#

and have them all run down each through the same surface

#

what I like about this is there's no overlap in where each thread will write to

#

I'll work down the bounds of a triangle and do tests on the 4 corners, ie for a 8x8 quad to see whether to even bother with doing any work and each 4x4 will also do a test

#

I need to keep everything aligned, the bitmap itself is the size of the full screen no matter how big the window is so I think it is safe to overwrite

#

on arbitrary window sizes

#

the in window UI windows are a bit trickier

broken fog
cloud rivet
#

just stick stuff into simd registers, do ops

#

idk

#

it's magic

#

after I'm done reading the intel manual I think I'm going to read the latest C spec, and after that maybe I'll read about clang, since the optimizations it is doing are so dramatically impactful I should learn more?

#

I'm kind of worried that by doing the simd myself I'll actually make shit slower

#

because maybe the compiler is better at autovectorizing than I am at figuring out what I should do when

#

I'll make a change and see how it goes

#

you can see my raster triangle frame time dropped in that screenshot I hadn't even made any changes to it, same with the render text

#

like all I did I think was add the compiler flags

#

render text dropped from 200 micro seconds to 40

cloud rivet
#

that manual prints nicely as a pdf

#

oh there's a -fproc-stat-report so I can just have the compiler print how long it took in stead of me trying to measure with a cli tool

echo crystal
#

i think you should get model loading before optimising it too much

#

to get more "real" work

cloud rivet
#

I have to write a model importer lol

#

I will just write an obj importer to start with

#

I can’t support gltf until I write a json parser

bronze socket
#

you should watch the simdjson talks if you really want a rabbithole

cloud rivet
#

I kind of need a png thing too

#

also I need a way to do matrix math

#

I just don't have anything yet

echo crystal
#

the pain of having to diy frogsippy

cloud rivet
#

it's fun

#

I have 10k lines of single triangle rendering

#

oh my mesh shader has three triangles, that's right

echo crystal
#

nice

cloud rivet
#

it's a triangle medley

cloud rivet
astral hinge
cloud rivet
#

I don't understand that site at all

#

that's like benchmarking data?

#

I'm just trying to understand what ops to use 😅

#

that's pretty hard

astral hinge
#

since x86 instructions are basically high level and are implemented with micro ops

#

so they have different perf characteristics on different arches

cloud rivet
#

I think I just don't know enough to make any sense out of the information there

astral hinge
#

so I guess that site is more like an alternative to Agner Fog's instruction tables

#

but I mean you can explore and read about different instructions without worrying about how they perform on different arches

#

same as the Intel intrinsics guide

#

well the Intel guide also has the perf info, but only for their arches

#

uops also has AMD arches

cloud rivet
#

I see

cloud rivet
#

you know what I did

#

I removed all my hand crafted simd

#

and everything got faster

broken fog
#

average simd experience

cloud rivet
#

I'm too ignorant

#

the auto vectorization is way better

#

so just rewriting my code

broken fog
cloud rivet
#

and adding -mf16c

broken fog
cloud rivet
#

it's those two things I get 100fps with just -mf16c and then another 100fps with my reorg

#

I went back to my code before I reorganized it and just used -mf16c and it went from 70fps to over 100

#

with both I get 200-300fps

#

I am actually undoing more of this code to see I get it even faster, the simd's removed but I did some weird stuff to get simd

cloud rivet
#

you know what I realized

#

I don't need to test bary centric coordinate every pixel

#

just along the edges do I have to test it for every pixel

#

of the triangle

#

anyway

#

I agree with dodo

#

I think I should start like rendering actual things

cloud rivet
#

I've just accidentally keep failing into better perf

broken fog
#

one thing i did find out working on my last sw rasterizer is you can calculate the barycentric coords incrementally and save a whole bunch of ops

#

didn't go much further than that tho cause i wrote the whole thing in like an afternoon

cloud rivet
#

Yeah actually i need the barycentric coordinates for interpolation

#

So it’s more about outside the triangle I guess

#

You wrote a thing that works in an afternoon though, that’s amazing

#

Incremental is interesting

cloud rivet
#

The lesson learned the last few days is write vector code in a way to enable the compiler to batch work without trying to tell it how

#

And to use F16C when using half types

broken fog
#

but yea

#

it barely works but hey it does work

elfin cape
#

@cloud rivet I forgot to respond to you about the rad linker. For my own project I havent used since its okayish but would like to use it at work but the whole codebase and CI is such shit show its not possible.

#

I plan to use it in the next project thats in the works but I dont have that much free time...

cloud rivet
#

nice

#

I don't link anything other than win32 and the vulkan dll so I haven't looked into it, and I run a unity build

#

I don't really have any linking problems to solve

#

if it's faster that's cool

elfin cape
cloud rivet
#

those are good things to link

#

what is daxa

elfin cape
#

vulkan abstraction that I use made by lpotrick, saky gabe rundlett

#

its really nice

cloud rivet
#

is that used instead of making vk api calls directly?

elfin cape
#

yes

cloud rivet
#

neat

elfin cape
#

there is bunch of macros to share code between shaders and C++

elfin cape
#

sol2 was the most expensive. Just removing it lowered the compile time from 20s to 10s

#

thats on my 7950X

cloud rivet
#

10s compile is not horrible

elfin cape
#

it could be much better. I do some things that are not optimal. I hope the next project is going to be much better. We are using interfaces and forward declaring as much possible

cloud rivet
#

I wish C++ modules were supported better

#

it seems like they aren't really usable?

elfin cape
#

the support is okay but compile time is 5x worse

#

compared to headers...

cloud rivet
#

uh

#

my unity build has been going well so far

#

it's just a hobby project

#

not a real thing

elfin cape
#

You get benefit of C compile times

cloud rivet
#

yeah

elfin cape
#

no template explosion, etc...

cloud rivet
#

compiles under 1 second still

#

I honestly never want to write in anything but C at this point, although I do sometimes miss zig

elfin cape
#

I deal with that at work and its not fun at all. I deal with 3 hours compile time. I had to compile today twice...

cloud rivet
#

my work's go monolith takes like 10 minutes to compile, but 3 hours jfc

elfin cape
#

the worst thing this isnt a monolith thats just one product...

#

I am really happy about getting rid of boost that should cut down the compile times down a lot.

echo crystal
#

what linker ?

elfin cape
#

rad linker

#

the reason why the fortnite is listed there is because they reached the limit of symbols in pdbs KEKW

echo crystal
#

does it work on loonix

elfin cape
#

on linux you have mold

echo crystal
#

i think rad debugger doesn't

elfin cape