#Rosy
1 messages · Page 15 of 1
That was an example but as I am watching the video. It might be a 4k screen 
>wmic desktopmonitor get Caption, MonitorType, ScreenHeight, ScreenWidth
Caption MonitorType ScreenHeight ScreenWidth
Generic PnP Monitor Generic PnP Monitor 2160 3840
Default Monitor Default Monitor 1440 2560
Default Monitor Default Monitor
3840x2160 is the display I was recording
I'm going to draw the windows first and track those regions to avoid rendering the bg on
if I don't draw the bg or the triangle I get double the perf
if I only draw fps
it's the bg and something totally unrelated to drawing
I just need to start profiling
a mostly blank screen getting 77 fps is pretty ridiculous
this is my first C program
do you have a 144p monitor
144hz
not 144p lmao
my guess is that you have vsync on and it's locking to 77
I did change my driver from studio to the game driver
maybe I did something there
yeah if I run rosy I can get way more
this is just my application being slow
I will start on a profiler using the window event tracing and performance tools
it'll be fun
just saying, they are going to remove wmic with the next win11 release 25h2
yeah planning on using win32 event tracing and diagnostic apis
those aren't deprecated
I will do what tracy does and use macros to trace in my application that are noop unless I define a #define profiling macro
and how I think I'll make this work is that when profiling is enabled I'll add a keyboard shortcut and it will profile for a short bit and dump a text file when it's done
I'll start on that today
my goal is to just start with a frame marker, then add support for zones, and then markers, figure out what's slow and maybe add memory profiling and just figure it out
I'll just work on it on a bit as things are slow and I need more profiling capabilities
to resolve slowness
I need to add a loop sampling thingy I guess
I'm also missing something
there's a lot of time spent somewhere idk what it is
ah
it's the render graph
that feels really slow
void start_tr(AppContext *actx, Arena *arena, u32 trace_id) {
if (!actx)
fatal("no actx");
if (!arena)
fatal("no arena");
if (!actx->trctx)
fatal("no trctx");
if (actx->trctx->num_traces <= trace_id)
fatal("traces overflow");
QueryPerformanceCounter(&actx->trctx->traces[trace_id].starting_time);
}
void end_tr(AppContext *actx, Arena *arena, u32 trace_id) {
if (!actx)
fatal("no actx");
if (!arena)
fatal("no arena");
if (!actx->trctx)
fatal("no trctx");
if (actx->trctx->num_traces <= trace_id)
fatal("traces overflow");
QueryPerformanceCounter(&actx->trctx->traces[trace_id].ending_time);
u64 duration
= actx->trctx->traces[trace_id].ending_time.QuadPart - actx->trctx->traces[trace_id].starting_time.QuadPart;
duration *= 1'000'000;
duration /= actx->trctx->frame_frequency.QuadPart;
actx->trctx->traces[trace_id].duration = (f64)duration;
actx->trctx->traces[trace_id].duration /= 1'000'000;
}
whatever this is
LARGE_INTEGER StartingTime, EndingTime, ElapsedMicroseconds;
LARGE_INTEGER Frequency;
QueryPerformanceFrequency(&Frequency);
QueryPerformanceCounter(&StartingTime);
// Activity to be timed
QueryPerformanceCounter(&EndingTime);
ElapsedMicroseconds.QuadPart = EndingTime.QuadPart - StartingTime.QuadPart;
//
// We now have the elapsed number of ticks, along with the
// number of ticks-per-second. We use these values
// to convert to the number of elapsed microseconds.
// To guard against loss-of-precision, we convert
// to microseconds *before* dividing by ticks-per-second.
//
ElapsedMicroseconds.QuadPart *= 1000000;
ElapsedMicroseconds.QuadPart /= Frequency.QuadPart;
aren't you supposed to multiply the result of QPC by the frequency of the cpu or something
duration /= actx->trctx->frame_frequency.QuadPart;
ah
so I think its ms
18 us for cpu raster seems quite fast
sorry
idk what it's rasterizing though
ah
many opportunities for improvement then
ya
oh man it should've been obvious when it said frame=0.033 lol
lol
the render graph is really slow
the cpu raster being slow is whatever, I knew that would be slow
now that you have a profiling thingy, time to spam it
ya
I think you should make a sampling profiler too just to get a quick overview of stuff without having to instrument
Well there is the win32 function (s) for getting the stack pointer and then the call stack from it, which you can call from another thread (so it costs basically nothing on the main thread)
Idk what the function is though so you'd have to do a little research
Anyway, you can put the call stack info into a map and then do a little math to figure out how long you spent in each place
oh I see
You take a sample of the call stack every millisecond or so
ohhhh
That's how the profiler works in visual studio
It'll be so cool to have your own suite of tooling
If something lacks you can just improve it
yeah, I want to add vk timing queries too
also memory use, since I have a custom allocator I can track all my memory
hmm in debug mode you could make the allocate function a macro that records the source info too
what is xxx_tr? xxx_trace? or xxx_tablerow? or xxx_transient?
did anyone tell the c people that we have enough disk space today for storing all the characters for our source codes 🙂
I have a character budget because I am a responsible adult
waste not want not
Also clearly it would be xxxxx_trow and xxxxx_trs
smh
Nah
I do this with subsystems, a major architectural piece in my code. It’s spammed everywhere. There’s not that many and they all have short versions of their names
There will only ever be one thing that gets to be called tr
The profiling code
I don’t want long varying names for these
I can spot a subsystem function easily. It’s easier to see the pattern working with my code
My internal functions are long and descriptive
These subsystem things all have a fat structure that is on the app context
actx->trctx->longthing.morelongerthings = bignamestuff;
you know what tr in trctx is because there’s only one tr
using StackWalker is going to be some really "works on my machine" level code
lol
this is a how to, not a library
well it is a C++ library also, but I'm not using it
To walk the callstack of another thread inside the same process, you need to suspend the target thread (so the callstack will not change during the stack-walking).
heh
no thanks
I think I don't need to walk it
I can just get the current stack
the current place
I believe you can also record the pc and then get just the current function (no call stack) from that
pc?
program counter
it's probably SymFromAddr
any game worth its salt should be adding -lDbgHelp as a dep tbh 
you know, if this game was for reals from scratch I wouldn't need that, I'd just write my own operating system on my own from scratch mined materials via a machine I built by hand using tools I also built
🪤
the silicon part is genuinely not doable but writing your own os is
do it it will be fun 
you probably need like 30 PhDs of knowledge to begin making your own computers from scratch
maybe fpga
uhh I beg to differ
that's a 4 bit calculator https://blog.lapinozz.com/learning/2016/11/19/calculator-with-caordboard-and-marbles.html
and a lot of mining permits
now run doom on it 
well I got a stackframe, now what, I guess I do that get GetSymFromAddr64 thing
pretty cool
I also saw a video of someone doing a computer made with water
mechanical computers are cool
Was it this one? https://www.youtube.com/watch?v=IxXaizglscw
The first 200 people to sign up at https://brilliant.org/stevemould/ will get 20% off an annual subscription.
Computers add numbers together using logic gates built out of transistors. But they don't have to be! They can be built out of greedy cup siphons instead! I used specially designed siphones to works as XOR and AND gates and chained them...
That is extremely cool
I should get SymGetLineFromAddr64 also
tracy calls SuspendThread
When a crash occurs, execution in the crashing thread is redirected to the handler that was set earlier. The handler lists all threads running in the program and one by one pauses their execution, leaving only two threads\footnote{There is actually a race, which can result in another thread starting executing, as suspending all threads is not an atomic operation.} in a running state: the crashed thread, which is executing the crash handler and the profiler worker thread. This is done either by calling the \texttt{SuspendThread()} procedure on Windows, or sending the unused \texttt{SIGPWR} signal -- during profiler setup another handler was installed for this signal, one that enters an infinite sleep loop.
only for crashes
nm
anyway
it works
void capture_stack_tr(Arena *arena) {
if (!arena)
fatal("no arena");
CONTEXT context;
RtlCaptureContext(&context);
DEBUG_PRINT("capturing stack %d\n", context.Rsp);
STACKFRAME64 StackFrame;
StackFrame.AddrPC.Offset = context.Rip;
StackFrame.AddrPC.Mode = AddrModeFlat;
StackFrame.AddrFrame.Offset = context.Rsp;
StackFrame.AddrFrame.Mode = AddrModeFlat;
StackFrame.AddrStack.Offset = context.Rsp;
StackFrame.AddrStack.Mode = AddrModeFlat;
if (!StackWalk64(IMAGE_FILE_MACHINE_AMD64,
GetCurrentProcess(),
GetCurrentThread(),
&StackFrame,
&context,
NULL,
SymFunctionTableAccess64,
SymGetModuleBase64,
NULL)) {
fatal("stackwalk failed");
}
DWORD64 dwaddress = StackFrame.AddrPC.Offset;
DWORD64 dwDisplacement = 0;
const size_t symSize = sizeof(IMAGEHLP_SYMBOL64) * 1024;
IMAGEHLP_SYMBOL64 *pSym = arena_alloc(arena, symSize);
if (!pSym)
fatal("OOM");
memset(pSym, 0, symSize);
pSym->Size = sizeof(IMAGEHLP_SYMBOL64);
pSym->MaxNameLength = 1024;
if (!SymGetSymFromAddr64(GetCurrentProcess(), dwaddress, &dwDisplacement, pSym)) {
DEBUG_PRINT("nope %d\n", GetLastError());
} else {
DEBUG_PRINT("yup %s\n", pSym->Name);
}
}
yup capture_stack_tr
void capture_stack_tr(Arena *arena) {
now I guess I just yolo capture the stack from another thread lol
I think that might end up with use after free? 😨
I think this is UB tbh
hrm all I need is the address
I'll just try it
tracy doesn't use RtlCaptureContext
wtf is it doing
so just have to follow where it gets the address from
cheating!
tracy uses libunwind
huh what debugger is that
it's the rad debugger
it's just windows atm I think
it's called rad because it's made by the rad games tool team inside Epic
idk how you get a job where you can just make some open source win32 debugger at company like Epic
All DbgHelp functions, such as this one, are single threaded. Therefore, calls from more than one thread to this function will likely result in unexpected behavior or memory corruption. To avoid this, you must synchronize all concurrent calls from more than one thread to this function.
well that's fine
A handle to the thread for which the stack trace is generated. If the caller supplies a valid callback pointer for the ReadMemoryRoutine parameter, then this value does not have to be a valid thread handle. It can be a token that is unique and consistently the same for all calls to the StackWalk64 function.
well
You cannot get a valid context for a running thread. Use the SuspendThread function to suspend the thread before calling GetThreadContext.
I don't think this is possible
with windows apis
fuck all this
it was cool getting the current processes stack though
I'm going to write a macro to just my existing profiling and wrap function calls I think are slow
I can track nested profiling since it maintains a state in the trace context
I learned a lot of stuff
event tracing is probably thing to use for this at some point
I'm going to add profiling info to my render graph so it records how long each part of the graph took
even with event tracing I'd have to instrument my code
it all makes sense
cool
ok, I'm just getting data still
I'll think about solutions later
really happy with my render graph tbh, it was easy to hook perf onto it
this is my render graph code now:
void render_gfx(AppContext *actx, Arena *arena) {
if (!arena)
fatal("render_gfx: null arena");
if (!actx)
fatal("render_gfx: null actx");
if (!actx->gctx)
fatal("render_gfx: null gctx");
if (actx->gctx->minimized)
return;
start_render_graph_tr(actx, arena);
render_node_t *current_node = actx->gctx->render_graph;
while (true) {
render_node_t *next_node = pump_graph(actx, arena, current_node);
add_render_graph_tr(actx, arena, (u32)current_node->render_node_type, current_node->name);
if (!next_node)
break;
current_node = next_node;
}
end_render_graph_tr(actx, arena);
return;
}
@elfin cape is correct, filling that full bitmap takes no time at all, it's all other stuff
ok next thing I have to build is getting the average time spent in a function and the full time spent in a function per frame
this is all cpu bound
pretty sure?
I don't know actually
this is great, I was going to do a thing where I don't draw the background in areas I drew the window, and now I know that wouldn't help much at all I'd be saving in the least expensive part of the rasterization
that app perf window is hard to read I should nest the zones
I actually need a way to do that for any zone, find the average time in the zone, the total time spent in the zone per frame
now need to do the average & total time per frame thing for each zone
I was thinking of just using the mapped host visible vk buffer for the bitmap instead of a cpu bitmap to avoid a copy
I still have to get more data, but I should cache the characters also
Are you using scanline algorithim for rasterizing a triangle?
it's just a really dumb bounding box for the triangle, then do a barycentric coordinate check for each pixel, I'm going to get to that
I knew that would be slow
just checking 
oh the render gfx should be nested under run app
once I have the data for per frame averages/total time of the slow stuff that runs in a loop I can start finally working on actually improving the perf
I might dump clang language extension vectors and matrices, they are a bit wonky I think
as in
they create weird padding and alignment issues on structs?
it's very strange
but writing vecmath in c without them is 
what are you using?
I don't really care about operator overloading
sure it's more readable to use operators, it's not the end of the world
i'm writing cipipi anyway but i'm using apple's simd math lib which i'm fairly sure relies on that clang extension
ah
could just be a me issue
I guess I could look at compiler explorer the next time I run into a problem and try to understand it better
ok
it's just for loops on the rasterization I think
the total time inside the loop is far less time than just the work of iterating the loop
unless I have a bug in this logic, which I don't think so
start_tr(actx, arena, actx->crctx->traces[18]); // x loop start
{
for (i32 x = x_min; x < x_max; x++) {
start_tr(actx, arena, actx->crctx->traces[21]); // bb col start
{
float4 pos = pixel_to_sc(actx->crctx->bitmap_width, actx->crctx->bitmap_height, (f32)x, (f32)y);
start_tr(actx, arena, actx->crctx->traces[23]);
float4 bc = barycentric_coords(pos, t1);
end_tr(actx, arena, actx->crctx->traces[23]);
if (bc.x >= 0.f && bc.y >= 0.f && bc.z >= 0.f) {
rgba_to_rgba16(bc_mix(t1.colors, bc), pixel);
}
pixel++;
}
end_tr(actx, arena, actx->crctx->traces[21]); // bb col end
}
}
end_tr(actx, arena, actx->crctx->traces[18]); // x loop end
it's just that for loop itself
for (i32 x = x_min; x < x_max; x++) {
idk
nesting these traces is really error prone
I need a better way
anyway
I think I got it
enough to go on
the more traces I showed in the UI the slower it got lol 
now numbers just have to go up
or down
got it a little bit faster
for (i32 x = x_min; x < x_max; x += 6) {
+ f32 x_f = (f32)x;
start_tr(actx, arena, actx->crctx->traces[20]); // bb col start
- {
- float4 pos = pixel_to_sc(actx->crctx->bitmap_width, actx->crctx->bitmap_height, (f32)x, (f32)y);
- start_tr(actx, arena, actx->crctx->traces[23]);
- float4 bc = barycentric_coords(pos, t1);
- end_tr(actx, arena, actx->crctx->traces[23]);
- if (bc.x >= 0.f && bc.y >= 0.f && bc.z >= 0.f) {
- rgba_to_rgba16(bc_mix(t1.colors, bc), pixel);
- }
- pixel++;
- }
+ cr_test_triangle_pix(actx, arena, pixel, x_max_f, x_f, y_f);
+ cr_test_triangle_pix(actx, arena, pixel + 1, x_max_f, x_f + 1.f, y_f);
+ cr_test_triangle_pix(actx, arena, pixel + 2, x_max_f, x_f + 2.f, y_f);
+ cr_test_triangle_pix(actx, arena, pixel + 3, x_max_f, x_f + 3.f, y_f);
+ cr_test_triangle_pix(actx, arena, pixel + 4, x_max_f, x_f + 4.f, y_f);
+ cr_test_triangle_pix(actx, arena, pixel + 5, x_max_f, x_f + 5.f, y_f);
+ pixel += 6;
end_tr(actx, arena, actx->crctx->traces[20]); // bb col end
}
adding more doesn't really make it much faster
I'll add something to avoid areas that I know won't have any part of the triangle in it
that should reduce things
I can do these concurrently too
and use simd I guess
I'm going to look at my text rendering next that's also slow af
for the same reason
hrm I just need to do less here with the characters, by caching them I think
I shaved like 5 ms by doing the same thing to chars, only that's more complex and breaks character rendering
I'm going to add character caching
it's a whole lot of just do less
is your renderer multithreaded yet?
that's an easy win if it's not
well, "easy", you know what i mean 
I don't want to do that yet
I think there's a way to get this to be faster just single threaded
one thing is to just reduce the scale, I probably shouldn't be going 4k with a cpu software rasterizer
ye 
yeah, fair enough
mt will add complexity so getting it to be as fast as possible single threaded is a good idea
but keep in mind some optimizations may be very well suited to a multithreaded renderer
yes, I plan on getting there
btw a super high res is probably good for testing perf
any changes will be far more obvious
your idea about avoiding empty areas of the tri is pretty solid, you could look into dividing the tri into tiles or something like that
just thinking out loud but maybe a lil quadtree (2 or 3 levels depending on the size of the tri, no more) could speed things up significantly
yes
I'm going to get rid the ability to scale a window, it looks dumb and adds a lot of complexity
I can't really added the tracing code I have into the hot loops, the profiling itself distorts the profling
well
I did learn something from it
doing much better already
haven't even done the character caching yet
was at 28 fps before with this amount of text
with less text actually, and less profiling, so it's faster now with more text and more profiling
yeah I am going to cache the characters next
the next thing I'll do is try to use the host visible vk buffer for the bit map instead of a separate set of bytes
I'm hoping I can get 3ms per frame out of those changes
after that I'll add a slider ui widget
and then I'll use that to dynamically adjust the scale to find a good perf for this window size
if I shrink the window to this
I get 60 fps
hrm
I think the windows capture drops the frames a little
obs also drops the frames a little
after font caching, a good scale, I'll add multithreading
font caching + vk buffer, scale and multithreading
and the multithreading will I think be generating frames in the background, thinking I'll have a fif type deal
once I load more triangles I'll probably have to resort to simd and 32 byte aligned operations
from:
to
rendering even more text
my windows and ui basically don't cost anything anymore
before if I added text and made the windows bigger the fps would drop massively
tomorrow I'll try and use the vk host visible buffer as the bitmap
and see if that makes things faster
then I'll add multithreading and go back to working on adding some ui features and start actually doing some 3D maybe, finally
what was the thing that was slowing down the text so much? 9ms -> 0.6ms speed up is really nice
I had a unecessary bounds checking and I removed scaling code and then I added font caching
I reduced the size of the function that rendered text quite a bit
about that I read some really long the ago that it would be slower but it would be nice to benchmark it
as long as it's faster than copying the full bitmap to the vk buffer it should be a win I hope
it's costing me a full ms

that's not even the submit,
I think the submit must be part of the wait? I don't know
ngl I would do normal hw raster 
oh you mean just use the win32 apis?
render triangles using vulkan 
oh okay
I have a gpu software raster pipeline set up
and a graphics pipeline
and a rt pipeline
I'm going to just work on them all
when I get bored with one I work on the other
oh another thing I did to make ui render faster is just draw the windows bg all at once
and I write the first row to memory and then just copy it to the rest of the rows
that made things go faster too, I had all this complex, and buggy, logic with padding and nonesense and I just removed it all and just draw a square, and then render font on top of it instead what I was doing before where I drew some part of the window, then some font the rest of the line, etc
I think I need the concept of a pixel
right now I am doing a serial process of here's my current pixel, let me convert it to coordinates, now let me do barycentric coordinates, does it pass or fail? mix a color, write it to bitmap, move on to next pixel
what I think I should do is actually uh
break all that up into layers
get a bunch of pixels all together at once
get their coordinates
get their barycentric coordinates
do the check
mix a color
write to bitmap
then I can do simd I think
the barycentric corodinate math would be a bunch of instructions based on how that's currently written
as would the mix
so I have to rewrite those
I think I should do this
I am going to test the vk buffer host visible mapping, maybe that's slow? I don't know
I'm also looking into hardware counters for profiling
specifically for my hardware
QueryPerformanceCounter/Frequency
I have that already
it's slow af
I need to sample
I think
I need to figure out how to do them periodically
I am currently looking at model specific registers for intel though
downloading the intel docs to read on my ipad https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html
apparently my hardware sucks, Core i5-10400, I want all the new things
the AMX stuff 
let me try this crazy vulkan buffer idea
maybe it is slow
I think I need to actually make a render graph node for the cpu rasterizer for this to work
since I have a per fif staging buffer
honestly doing the UI and cpu raster in a render graph makes a lot of sense 
it is rendering
it'll only be slow if the memory is device local and you have a dGPU
also, what GPUs do is hierarchically rasterize blocks of pixels
so you start by testing all the large squares of pixels the triangle's AABB touches, then narrow it down a few times
at the lowest level it might be 8x8 or something
which you can SIMD-ify pretty well
nice
well I guess the actual lowest level is 2x2 since those are quads of invocations. you might be interested in that for derivative calculations
you can even write your own "shaders" that are just callbacks
well a shader I can write to be a sequential set of operations on a per pixel that the gpu driver actually parallisizes all the instructions for yes?
but on the cpu side I have to do this myself I think?
hrm
I see what you mean I think
yeah auto parallelization of shaders would be super hard
I'm not sure what derivative calculations you are referring to
unless you write a vm or something
the ones for dFdx and dFdy that are implicitly used when you call texture functions with implicit lod sampling
right right
I want to use those for tangent generation also
I just vaguely know about it, I have to read through it to understand the actual math
the math is pretty
le
nice
yes that's right I have that in my notes from when we talked about it before
when I was looking at spirv stuff
yeah so your gpu code is automagically SIMD because the compiler needs to do that anyway
but on the cpu it's tricky because your "shader" won't run in blocks of 2x2 threads in lockstep
that's why I suggested a vm earlier
yeah, I'm just going to avoid trying to replicate shaders where I work on one pixel at a time, and do CPU specific work in blocks I think
if the work unit is 2x2 then you can hand-write simd shaders if you want, or just do a loop over the pixels
that would be a big perf win I think 2x2 simd
and not to hard to think about
as in the code won't be incredibly complex
I'm going to try that thanks
the call stack from another thread was a big no go btw
since you have to suspend threads for that to work
it was windows
yes I am curious about your profiler code also
I have my school projects scattered around my drives so I need to search
suspending a thread is horrific, if it is holding on to sync primitives for resources shared with other threads
I'm pretty sure I didn't do that
considering it was sampling I think thousands of times per second
which is also the frequency of the normal vs profiler
huh I can't find the course files anywhere on my pc
was this for an undergraduate class?
yeah
feels like maybe you had unintentionally produced phd dissertation level work
maybe it's on ur github?
classwork on github is sus
I checked that too
well there actually was a github organization that I was in for the class, but I left it
I thought I'd have a repo though 
there's like some academic integrity problem with doing that imo
not if it's private 🤫
just my opinion, but there may be uni policy also
we were supposed to join the org for the class
and it was private
somehow the way we uploaded our projects was private to other students
maybe the class was just ok with the UB of doing it without suspending the thread
lol maybe
russian roulette profiling
btw I got that 11" ipad air with a pencil and I love it
wow pretty interesting
nice 
have you guys ever seen The Deer Hunter? great movie, except for the weird Russian roulette part
I like with the pencil how I can hover my pencil over a link and click the pencil to follow links
I wonder if all my files for that project were on a school drive
huh so my quadtree idea wasn't too far off
ooh nice, m3?
yeah it's just a fatter tree
like more subdivisions per level?
cringe
all my uni assignments are on gh and open source 
unless the course specifically disallows it
I gave up looking for my projects for that class 
thanks for looking
I'm not sure what it has, I think m3
the latest air is m3 so if you bought new ye probably
mine is m1
honestly both are incredibly overkill for what most people do with an ipad 
yeah
actually needs the m3 maybe for that? these pdfs are huge lol
you know you have rt hardware in that thing right?
yeah but I don't plan to write apps for gross locked down sandboxed mobile devices
based
when people tell me they're writing for android or browsers I just feel so sad about it
browsers are fine tbh
or if they're talking about it in one of the channels
not the same as native dev but
idk, not for me
it's not a walled garden ecosystem like mobile
it's still kinda shit imo but I understand people can find that interesting
the web's actually kinda nice mostly because of how easy it is to ship something and share it with people
I hate it so much lol sorry
I think peak browsers was 1990's HTML
I like being able to see images and videos like YT I guess
I wish I could just use lynx to browse the web
if u have adblock web is sooo much better
yup of course
this is not a platform issue, it's a content issue (websites are garbage)
the web platform itself is pretty cool
yeah
if you want to build nice ui quick there's nothing better than html+css imo
same
I mean for work it's fine
just in my personal time I don't want to do anything with it
@astral hinge maybe you used https://learn.microsoft.com/en-us/windows/win32/debug/capturestackbacktrace it capture the current thread though
I used to love web dev :\
I think work killed it in my soul
it's also so much work now to build a website, and so expensive. A k8s cluster cost hundreds of dollars per month
if you just want some static website, sure
you could build a server less website using aws and step functions
yeah you could just run a LAMP stack or Django or whatever, heh
in 2025 👀
basic html + crappy used pc + home wifimaxxing
:|
I don't run servers in my house
I guess App Engine or whatever is the thing to use maybe
iirc snapchat was on App Engine for a very long time
I have a little nuc I got off ebay but I just use it to host static files to stream to my friends mainly + a matrix server
I think the last time I was excited about webdev was when Meteor was a thing
I used to go to the Meteor HQ for meetups every month
then react won
I don't think I have whatever it takes to be excited about web frameworks
yeah it's jover for me too
or tech in general
Meteor was so much fun
I've noticed the only people still excited about tech are the ones that know the least about it
are you not including graphics programming as tech?
all I want any more involves a C compiler
just moving my stuff to the render graph improved perf lol
frame time dropped by like 3ms and got like 5 more fps out of it
what it was before
gonna try passing in the staging buffer now
and removing the copy
I don't understand why that improved performance
how do computers work
I assume your frame graph is smarter and emits better barriers and whatnot?
it doesn't emit any for any of this though
it does for vk transitions that need them
lol using the staging buffer works but it is dramatically slower
haha
nah
I exchanged a 1ms copy for an extra 10ms to raster
yeah vercel would be great actually
maybe I can configure this buffer differently
const VkBufferUsageFlagBits usage_flags = VK_BUFFER_USAGE_TRANSFER_SRC_BIT;
const VkMemoryPropertyFlags mem_properties
= VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT | VK_MEMORY_PROPERTY_HOST_COHERENT_BIT;
VK_MEMORY_PROPERTY_HOST_CACHED_BIT bit specifies that memory allocated with this type is cached
on the host. Host memory accesses to uncached memory are slower than to cached memory,
however uncached memory is always host coherent.
that also increased the copy to draw image time and both of the submit times by 1ms
it's nuts
I don't see anyway to do this and it not be slow
but
I removed VK_MEMORY_PROPERTY_HOST_COHERENT_BIT since I don't need that and went back to using my bitmap I didn't get a win here either though
I did get a huge win from just shuffling my code around?
ok
I'm going to do the simd 2x2 quad stuff jaker brought up
I am regularly seeing 45 fps now
23ms frames
we are doing science in this thread
oh
also I had a bug with cpu raster and resize, and that's now fixed too by moving everything into the render graph
my render graph is fucking magical
it makes things faster and fixes bugs for me
emergent render graph behavior
kinda hard to reliably get timings on light computation loads anyway because all the crap happening in your OS is pure noise to the timings
true
I am pretty certain that this change made it consistently faster
I think simd, MT and reduced scale and I'll be unblocked by performance issues on the CPU software raster progress for a bit
lol
you know what
this whole week
I have been running with verbose validation enabled 

still going to do the simd idea though
and multithreading
I'm so close to be finally doing 3D graphics
oh reading intel's manual I realized I should be using fused multiply add for my dot product, all kinds of things, I could be using shuffle to set a whole bunch of color values at once based on barycentric coordinates, man
but even without avx I should be using fma
for better precision
as it rounds less
I like reading the manual on my ipad on the bus, I also am reading the tinyrenderer lectures for software rasterization
I am pretty happy with the tech on this project. I think I have made major strides since Rosy
soon I will have nicer pixels
I'm serious about ripping out shaders and generating SPIRV with application code. I think in my UI I will make a SPIRV disassembly viewer for debugging. That's a ways off though, I want to get some 3D and lighting going in the CPU software renderer
also an IR view
maybe an IR view and I can click on or hover over the IR and see the SPIRV dis it will generate
I like how I added a string to my render graph nodes to document its intent when debugging I will do that with the IR too
@cloud rivet I found the code with the help of a former classmate of mine. sadly it looks like it just suspends the thread and resumes it for every sample
somehow it gets 1000Hz resolution though, so maybe suspension isn't that expensive
here's an example of the output it'd generate when you call GenOutput()
thank you! someone on another discord suggesting using core affinity for the thread I want to sample and then another thread on the same core that wakes up periodically, which guarantees the thread is suspended
suspending a thread is very shitty if you have a mutex for example
in that thread and other threads
maybe things can deadlock
hmm I can see how it could slow down other threads, but not how it could cause a deadlock
I don't fully understand this
does that mean checking the state of a thread and only sampling the program counter when it's suspended?
you'd also have to make sure it doesn't wake up while you're doing that, right?
numerous examples
http://blog.kalmbachnet.de/?postid=6
http://blog.kalmbachnet.de/?postid=16
http://blog.kalmbachnet.de/?postid=17
Infos about windows development and dotNet-Framework and C#
Infos about windows development and dotNet-Framework and C#
Infos about windows development and dotNet-Framework and C#
seems to be an issue with using CriticalSection apis and other things
oh
replied to the wrong message
I guess the idea is that if you're running on the same logical core the other thread can't also be running? I may have misunderstood what they said
I see yeah
I mean it should be ok if your profiler thread doesn't do anything that takes a lock though
maybe GetThreadContext acquires a lock
the recommendation is to run stackwalk in a critical section based on the docs 😅
it's a cool thing to have
I may try it thank you for the codes
I will try it once I have a checkbox
then I can conditionally turn that on
I wonder if the compiler already just optimizes some code to FMA
is it allowed if you don't have fast math?
it does not https://godbolt.org/z/ssYqdGqen
let me google this fast math thing
I saw in the intel manual actually
let me actually look at clang compiler args
-ffast-math
I don't see it there either
I don't understand
because fma can have different precision than doing each operation individually
128-bit Legacy SSE version: The first source and destination operands are the same. Bits (MAXVL-1:32) of the corresponding the destination register remain unchanged.
the registers it uses are SSE
MOVHLPS help
This instruction cannot be used for memory to register moves.
128-bit two-argument form:
Moves two packed single-precision floating-point values from the high quadword of the second XMM argument (second operand) to the low quadword of the first XMM register (first argument). The quadword at bits 127:64 of the destination operand is left unchanged. Bits (MAXVL-1:128) of the corresponding destination register remain unchanged.
128-bit and EVEX three-argument form
I'm going to compile with -ffast-math
hrm
running with smiliar frame time
using -ffast-math without optimizations just produces the same result as no args
Enable fast-math mode. This option lets the compiler make aggressive, potentially-lossy assumptions about floating-point math. These include:
Floating-point math obeys regular algebraic rules for real numbers (e.g. + and * are associative, x/y == x * (1/y), and (a + b) * c == a * c + b * c),
No NaN or infinite values will be operands or results of floating-point operations,
+0 and -0 may be treated as interchangeable.
-ffast-math also defines the FAST_MATH preprocessor macro. Some math libraries recognize this macro and change their behavior. With the exception of -ffp-contract=fast, using any of the options below to disable any of the individual optimizations in -ffast-math will cause FAST_MATH to no longer be set. -ffast-math enables -fcx-limited-range.
This option implies:
-fno-honor-infinities
-fno-honor-nans
-fapprox-func
-fno-math-errno
-ffinite-math-only
-fassociative-math
-freciprocal-math
-fno-signed-zeros
-fno-trapping-math
-fno-rounding-math
-ffp-contract=fastNote: -ffast-math causes crtfastmath.o to be linked with code unless -shared or -mno-daz-ftz is present. See A note about crtfastmath.o for more details.
- and * are associative
lol what
when is that not the case
oh maybe this is floating point spec stuff
-ffast-math also defines the FAST_MATH preprocessor macro. Some math libraries recognize this macro and change their behavior.
man
I am learning so much
the std library checks this
if the macro constants FP_FAST_FMA, FP_FAST_FMAF, or FP_FAST_FMAL are defined, the function std::fma evaluates faster (in addition to being more precise) than the expression x * y + z for double, float, and long double arguments, respectively. If defined, these macros evaluate to integer 1.
well
it checks other things
why am I linking to cpp
because your subconscious yearns for longer compile times
this is with fma
CVTSS2SD help
Converts a single-precision floating-point value in the “convert-from” source operand to a double-precision floating-point value in the destination operand. When the “convert-from” source operand is an XMM register, the single-precision floating-point value is contained in the low doubleword of the register. The result is stored in the low quadword of the destination operand.
128-bit Legacy SSE version: The “convert-from” source operand (the second operand) is an XMM register or memory location. Bits (MAXVL-1:64) of the corresponding destination register remain unchanged. The destination operand is an XMM register.
VEX.128 and EVEX encoded versions: The “convert-from” source operand (the third operand) can be an XMM register or a 32-bit memory location. The first source and destination operands are XMM registers. Bits (127:64) of the XMM register destination are copied from the corresponding bits in the first source operand. Bits (MAXVL-1:128) of the destination register are zeroed.
Software should ensure VCVTSS2SD is encoded with VEX.L=0. Encoding VCVTSS2SD with VEX.L=1 may encounter unpredictable behavior across different processor generations.
call fma@PLT does actually not appear in the -ffast-math version
if I am going just by counting the number of instructions
to determine speed
not using fma would be faster
but
it's probably not that simple
fma is more about avoiding loss of precision I think?
so fma is orthogonal to ffast-math ?
k anyway
it isn't
x86 instructions don't tell you a lot more than c code
who knows what the cpu is actually doing at the hw level
stupid out of order execution
and it probably changes for every uarch
jk, that makes thing goes faster
just write code for a 386 
I was reading the intel manual today and the first chapter is about the history of intel processors
so it had had the 286/386/486 etc listed and what the innovations were for each architecture
it was fun to read about
huh
I read like 5 chapters of that thing
idk when ooo was introduced
it tells you in there
iirc 386 didn't have it
pentium something?
it's like in the aughts
oh
P6
1995
I was off
The centerpiece of the P6 processor microarchitecture is an out-of-order execution mechanism called dynamic
execution.
hrm
converting all my raster math to operate on float4[4]'s
will start with just hand doing what I already do
just operating at 4 float4s at a time
then change the raster code to use that
and then figure out how to do that with simd
that way by the time I get to simd I know the rest of it works and it's just getting the simd to work
you shouldn't use it anyway because it breaks a bunch of stuff
excellent question
it's probably only useful when applied at the function level
I get what it's breaking sort of with floating point implementation
applying it to the whole application is just asking for trouble
not with unity builds lol
maybe there are pragmas that enable/disable scopes of fast math
which compiler are you using?
clang 20
for the vector language extensions
and because idk I can go read the source code for it I guess
clang has #pragma float_control
look for it in here
https://clang.llvm.org/docs/LanguageExtensions.html
When pragma float_control(precise, on) is enabled, the section of code governed by the pragma uses precise floating point semantics, effectively -ffast-math is disabled and -ffp-contract=on (fused multiply add) is enabled. This pragma enables -fmath-errno.
ah
so disable it by default
hrm
interesting
I don't want fast math
I hate it
I didn't know it existed,but now I do, and I regret it
jk
I'm sure it has its uses whatever those are
big brain stuff
rewriting things to be float4[4]s is forcing me to fix my horrific raster code
which was just sort of written to work as I figured it out
so that's good
as long as I don't make anything slower
i was writing this garbage when I was recording my 1 hour videos
My guess is it’s legacy hardware oriented optimizations, potentially?
why do you think that?
I’m don’t do much at a hardware level, but I’m pretty sure floating point math is pretty fast nowadays
I also don’t know exactly how fast math works, but I for some reason want to assume it’s doing something in software to avoid some check?
I think ffast-math is about taking shortcuts
ffast-math relaxes certain rules and allows the compiler to make more assumptions
and not old hardware imo
this work is hard
hrmm
might be faster to just rewrite the rasterizer
idk idk
I'm gonna keep giong
why is this making my rasterizer even faster
I haven't even done any simd yet
I was like at 25 fps 2 days ago
and I'm at 80 now
my frame rate dropped like 4ms from just migrating halfway to simd
what did you change?
I migrated my functions like this
float4 cross(float4 a, float4 b);
f32 dot(float4 a, float4 b);
void cross4x4(float4 a[4], float4 b[4], float4 out[4]);
void dot4x4(float4 a[4], float4 b[4], float4 *out);
float4 barycentric_coords(float4 pos, p_triangle t);
void barycentric_coords4x4(float4 pos[4], p_triangle t, float4 bc[4]);
but
I'm just passing in 1 thing
for each of those float4 [4]
I'm just setting the [0] value right now
simd for free
what's the 4x4 naming convention btw
for cross and dot it makes dense because there are four of each operand
sense*
but for barycentric there's still just one triangle
it's just like a marker that made sense with those and I kept using it
btw you can possibly get more perf by making the arguments restrict
hrm
I will put this in my notes, I don't understand what that pages is talking about right now
thanks
I don't want to cargo cult it, I want to thoroughly understand what that means
looks super interesting
allows the compiler to optimize, but I have to read it carefully
putting restrict on a pointer means that a write through another pointer won't affect anything pointed to by the first
so it means the compiler can assume writes through some other pointer won't change the restrict data, allowing it to possibly emit fewer loads
what does writes through mean?
just dereferencing and storing *foo = 42;
so it's a promise that doing *foo = bar; where bar is a pointer wont' change bar?
oh wait
I get it
no I don't get it
I'll illustrate with a function
void foo(int* a, int* b, int* c)
{
c[0] = a[0] * b[0];
c[1] = a[0] * b[0];
}
so here the compiler must assume that a, b, and c can alias each other (point to overlapping regions of memory)
so when c[0] is written to, it must assume that a[0] and/or b[0] could have changed (because I use them twice)
if a and b are restrict then the compiler can no longer assume anything written to aliases them
that makes sense, thanks!
back when I was learning zig I read through the entire spec all the time
and there are all these builtins
that are I guess kind of like qualifiers in C
and I encountered that I think
also you can use restrict on array parameters like this
void foo(int array[restrict 4]);
you can also put static inside the [] to assert that the pointer points to an array with at least 4 elements
this is a C only thing?
is this a new use of the word static?
yeah
that thing means so many things
In each function call to a function where an array parameter uses the keyword static between [ and ], the value of the actual parameter must be a valid pointer to the first element of an array with at least as many elements as specified by expression:
I think it's worth trying in your perf-sensitive math functions. it probably helps with autovectorization
actually it probably doesn't if you already have compile-time loop bounds hmm
idk what static could help with
oh it implicitly asserts the pointer is non-null
using [static 1] to document/assert non-null arguments is interesting
ugly syntax though
ya
my debug build is getting slower and my relese build is getting faster
actually
I can't really tell on the debug
because it's such a tiny number
it probably doesn't mean anything
it's been hovering around the same values tbh
lol

I haven't touched the triangle yet, that's just the background
_mm_storeu_si128 is so fast
I think I've been benefiting mostly accidentally from compiler optimizations as a result of changing my code and adding the compiler flags to enable simd
so after moving the rest of the rasterizer to the 4x4 SIMD, I have a plan for MT
I am thinking I will have 4 threads that each individually iterate over 1 of the quadrants of all 4x4 quads in a surface in NDC
and have them all run down each through the same surface
what I like about this is there's no overlap in where each thread will write to
I'll work down the bounds of a triangle and do tests on the 4 corners, ie for a 8x8 quad to see whether to even bother with doing any work and each 4x4 will also do a test
I need to keep everything aligned, the bitmap itself is the size of the full screen no matter how big the window is so I think it is safe to overwrite
on arbitrary window sizes
the in window UI windows are a bit trickier
this with simd? how are you using it?
just stick stuff into simd registers, do ops
idk
it's magic
after I'm done reading the intel manual I think I'm going to read the latest C spec, and after that maybe I'll read about clang, since the optimizations it is doing are so dramatically impactful I should learn more?
I'm kind of worried that by doing the simd myself I'll actually make shit slower
because maybe the compiler is better at autovectorizing than I am at figuring out what I should do when
I'll make a change and see how it goes
you can see my raster triangle frame time dropped in that screenshot I hadn't even made any changes to it, same with the render text
like all I did I think was add the compiler flags
render text dropped from 200 micro seconds to 40
that manual prints nicely as a pdf
oh there's a -fproc-stat-report so I can just have the compiler print how long it took in stead of me trying to measure with a cli tool
i think you should get model loading before optimising it too much
to get more "real" work
I have to write a model importer lol
I will just write an obj importer to start with
I can’t support gltf until I write a json parser
you should watch the simdjson talks if you really want a rabbithole
I kind of need a png thing too
also I need a way to do matrix math
I just don't have anything yet
the pain of having to diy 
it's fun
I have 10k lines of single triangle rendering
oh my mesh shader has three triangles, that's right
nice
it's a triangle medley
this is a cool site https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html
I don't understand that site at all
that's like benchmarking data?
I'm just trying to understand what ops to use 😅
that's pretty hard
it's architecture specific information about each instruction
since x86 instructions are basically high level and are implemented with micro ops
so they have different perf characteristics on different arches
I think I just don't know enough to make any sense out of the information there
so I guess that site is more like an alternative to Agner Fog's instruction tables
but I mean you can explore and read about different instructions without worrying about how they perform on different arches
same as the Intel intrinsics guide
well the Intel guide also has the perf info, but only for their arches
uops also has AMD arches
I see
you know what I did
I removed all my hand crafted simd
and everything got faster
average simd experience
when you're at the point you start looking at uarch opts you know you've gone off the deep end
and adding -mf16c
i mean yeah it's not a you problem compilers are very good at what they do
it's those two things I get 100fps with just -mf16c and then another 100fps with my reorg
I went back to my code before I reorganized it and just used -mf16c and it went from 70fps to over 100
with both I get 200-300fps
I am actually undoing more of this code to see I get it even faster, the simd's removed but I did some weird stuff to get simd
you know what I realized
I don't need to test bary centric coordinate every pixel
just along the edges do I have to test it for every pixel
of the triangle
anyway
I agree with dodo
I think I should start like rendering actual things
I've just accidentally keep failing into better perf
how do you detect the edges tho 
one thing i did find out working on my last sw rasterizer is you can calculate the barycentric coords incrementally and save a whole bunch of ops
didn't go much further than that tho cause i wrote the whole thing in like an afternoon
Yeah actually i need the barycentric coordinates for interpolation
So it’s more about outside the triangle I guess
You wrote a thing that works in an afternoon though, that’s amazing
Incremental is interesting
The lesson learned the last few days is write vector code in a way to enable the compiler to batch work without trying to tell it how
And to use F16C when using half types
well like two afternoons really
but yea
it barely works but hey it does work
@cloud rivet I forgot to respond to you about the rad linker. For my own project I havent used since its okayish but would like to use it at work but the whole codebase and CI is such shit show its not possible.
I plan to use it in the next project thats in the works but I dont have that much free time...
nice
I don't link anything other than win32 and the vulkan dll so I haven't looked into it, and I run a unity build
I don't really have any linking problems to solve
if it's faster that's cool
I link quite a lot of things
https://github.com/lukasino1214/foundation/blob/master/vcpkg.json
is that used instead of making vk api calls directly?
yes
neat
this is my task code for render graph
https://github.com/lukasino1214/foundation/blob/master/src/graphics/virtual_geometry/tasks/draw_meshlets_only_depth_masked.inl
there is bunch of macros to share code between shaders and C++
yes but its expensive 😭
sol2 was the most expensive. Just removing it lowered the compile time from 20s to 10s
thats on my 7950X
10s compile is not horrible
it could be much better. I do some things that are not optimal. I hope the next project is going to be much better. We are using interfaces and forward declaring as much possible
uh
my unity build has been going well so far
it's just a hobby project
not a real thing
You get benefit of C compile times
yeah
no template explosion, etc...
compiles under 1 second still
I honestly never want to write in anything but C at this point, although I do sometimes miss zig
I deal with that at work and its not fun at all. I deal with 3 hours compile time. I had to compile today twice...
my work's go monolith takes like 10 minutes to compile, but 3 hours jfc
the worst thing this isnt a monolith thats just one product...
I am really happy about getting rid of boost that should cut down the compile times down a lot.
what linker ?
rad linker
the reason why the fortnite is listed there is because they reached the limit of symbols in pdbs 
does it work on loonix
on linux you have mold
i think rad debugger doesn't
it will
