#Performance
1 messages ยท Page 9 of 1
Gotta start somewhere, the test suite is too large to effectively debug with initially
as I go on I'll use that but first I start simple
(remember that initially zero test pass, then 1k, then 2k, etc. it's not test-by-test thankfully)
Good luck krkrkr
Somehow I'm surprised typst has only 2632 tests
There are a bunch of tests that should probably be multiple
But also there are not enough tests
(There never are)
I am pretty sure that things like inspection are not tested or have very few tests and that's one of the biggest chance with the VM, it works completely differently ๐
No tests means you can silently break stuff and no one will know ๐
i didn't know I added 500+ tests for the language server. Adding tests is a heavy work tbh. In that meaning, 3000+ tests is huge. But the giant repo stories like rust and llvm do have much more insane number of tests.
I know you're tired of working on the VM, but the VM Chronicles that you've been posting here, have really been better than Netflix. I'll miss it when vm3d is merged. Hopefully laurmadje reviews it for a good few episodes.
There'll be a 2 Virtual 4 Machine after
Or Virtual M4chine
Then Virtual Machine with the V stylized as a roman numeral
then VIrtual machine
As I am debugging I keep printing instructions to debug them, so I finally made them "pretty":
[ 0] AllocateArray { size_hint: 4 }
[ 1] ArrayPush { value: Int(1), span: Span(285158810540467) }
[ 2] ArrayPush { value: Int(2), span: Span(285609892233913) }
[ 3] ArrayPush { value: Int(3), span: Span(286060973927359) }
[ 4] ArrayPush { value: Int(1), span: Span(286512055620805) }
[ 5] Iter { iterable: Stack, range: PointerRange(6, 38), span: Span(284858089411503), content: false }
>>>[ 6] Next { span: Span(283730385177888), pattern: Single(MutAccess { head: 0, segments: [], span: Span(283730385177888) }) }
[ 7] Eq { lhs: Slot(0), rhs: Int(1), span: Span(290069022724325) }
[ 8] Conditional { condition: Stack, cond_span: Span(290069022724325), end_if: Pointer(15) }
[ 9] Scoped { range: PointerRange(10, 14), span: Span(290200588218245), content: false, looping: false }
>>>[10] AllocateArray { size_hint: 2 }
[11] ArrayPush { value: Const(0), span: Span(290234654283635) }
[12] ArrayPush { value: Const(1), span: Span(290241702435095) }
[13] Join { value: Stack, span: Span(290229955515995) }
<<<[14] Jump { end: Pointer(25), span: Span(290017336280285) }
[15] Eq { lhs: Slot(0), rhs: Int(2), span: Span(290527152569225) }
[16] Conditional { condition: Stack, cond_span: Span(290527152569225), end_if: Pointer(23) }
[17] Scoped { range: PointerRange(18, 22), span: Span(290592935316185), content: false, looping: false }
>>>[18] AllocateArray { size_hint: 2 }
[19] ArrayPush { value: Const(2), span: Span(290609968348880) }
[20] ArrayPush { value: Const(3), span: Span(290613492424610) }
[21] Join { value: Stack, span: Span(290607618965060) }
<<<[22] Jump { end: Pointer(25), span: Span(290501309347205) }
[23] Scoped { range: PointerRange(24, 25), span: Span(290743295880665), content: false, looping: false }
>>>[24] ContinueOp
<<<[25] FlowOp
[26] Destructure { value: Stack, pattern: Items([Single(MutAccess { head: 1, segments: [], span: Span(289679025010205) }), Single(MutAccess { head: 2, segments: [], span: Span(289735410221885) })], Span(289641434869085)), span: Span(289538061981005) }
[27] Scoped { range: PointerRange(28, 36), span: Span(295101402866839), content: true, looping: false }
>>>[28] Join { value: Slot(1), span: Span(295204775754919) }
[29] Join { value: Const(4), span: Span(295223570825479) }
[30] Join { value: Const(5), span: Span(295242365896039) }
[31] Join { value: Const(4), span: Span(295261160966599) }
[32] Join { value: Slot(2), span: Span(295298751107719) }
[33] Join { value: Const(4), span: Span(295317546178279) }
[34] Join { value: Const(6), span: Span(295336341248839) }
[35] Join { value: Const(4), span: Span(295355136319399) }
<<<[36] Join { value: Stack, span: Span(295101402866839) }
[37] Jump { end: Pointer(6), span: Span(282828221790996) }
<<<[38] Join { value: Stack, span: Span(282828221790996) }
I think that VM3D might actually stand a chance of being merged
Like it's the best VM so far
Don't know about performance ||yet||, but in terms of quality for sure
I need a cool bytecode printer here
Ok but like, that's kind of awesome ๐
And once we get to episode 995 it'll be just VM
It's not bytecode anymore, I am gonna be exploring multiple repr:
Arc<dyn Instruction>&dyn Instructiondyn Fn(...)- a giant enum (I'll have to write it ๐ญ)
I am curious which performs better
Also, currently constants are indexed in a common array (for the currently compiled code), but I am thinking of just doing deduplication of constants and not storing IDs
why not enum Inst?
Also also, currently the VM uses a Readable enum for the different types of data (see below) and I might make it have a Const(Value) type.
#[derive(Debug, Clone, Copy, PartialEq, Hash)]
pub enum Readable {
None,
Stack,
Auto,
Slot(usize), // i.e read a local var
Take(usize), // i.e take ownership of a local var
Const(usize),
Bool(bool),
Int(i64),
Float(Scalar),
Std(usize),
Math(usize),
}
that's what I mean with the "giant enum"
oh I ignored it
for dev, the Arc<dyn Instruction> is by far the easiest
But it's probably garbage due to the two levels of indirection
also adds complexity in other places
Will probably make a macro for declaring the instructions again ๐คทโโ๏ธ
Two levels?
It's a Vec<Arc<dyn ...>>
so one level from the vec, then one level to the arg,
then yeah actually
the dynamic dispatch too lol

I'll be taking notes when the PR goes live
just have these magic compiler.flow() calls everywhere which I H-A-T-E but Typst has very peculiar control flow semantics which I can't find a better way of doing in compiled code ๐คทโโ๏ธ
(it adds a special FlowOp if the previous instruction does not cause control flow)
and many instructions can have control flow (which wasn't the case with the old VM where there was only FlowOp)
so it might be a bit faster
using arc dyn this doesn't look like to introduce a huge penalty on common PC, because cpus optimize and precisely predict such indirections. iot devices may hate it but this is not a first goal. but whether to use depends on whether it helps to build a full vm3d quickly.
That's why I am planning on testing all the reprs, to measure it on both my MPB and my Desktop PC
could also try a mixture of both, having hot paths without indirection as enum members with an "unlikely" member that uses dyn Trait
which one is which is best found out on real documents
Set & Show rules are in!
that's actually a very good idea! Especially with "bigger" instructions
(i.e ones where the struct is big)
1831 passed, 801 failed, 0 skipped
I'll add all of the math stuff now before it keeps bugging me lol
Do that! We merge at MIDNIGHT!
1836 passed, 796 failed, 0 skipped
I am very dissapointed in how few test adding math fixed
That means that most of math is broken lol
1900 passed, 732 failed, 0 skipped
Fixed one bug, just 732 more to go?
Should probably add context next
1983 passed, 649 failed, 0 skipped
Indeed, it helped
Oh boy. Midnight might have been too ambitious.
yes ๐ฆ
I mean, if every 10 minutes I make 50 more tests pass, then I just need 1h30 to finish
lol

@sturdy sequoia I'm curious, approximately how much of execution time is spent comparing hashes?
Not talking the vm specifically
Iirc a 128 bit hash is used. I'm wondering if there would be a minor performance gain in only comparing the first 64 bits unless a collision is detected
There would likely already be a gain from just hashing to 64-bit instead of 128-bit
You would still need to either hash to 128 bit or two different 64 bit hashes
I'm specifically talking about the comparison aspect
You can treat a 128 bit hash as two 64 bit ones though. You would almost never have to check the other half of it
Man my phone's auto completion is driving me nuts
Does that make sense?
Yes but that would be three instructions vs just one
so I doubt it would actually be faster
and since it's 128-bit, there should be no clock division on x86 (don't know on arm)
(and even the clock division would only be on Intel CPUs)
And actually that's only valid for AVX512 so it doesn't matter
I bet direct comp at 128-bit is faster
Simd instructions are slower no?
Why 3?
that's basically what the compiler already does when comparing 128 integers
and well, it still checks the other half
but not doing this would be slower because conditionals are slower than just a simple xor and or
not really no
not integer 128-bit, they are pretty darn optimized
but best way would be direct compare
๐ญ
128 bit integers*
I'm keeping this error
Nah, just VLIW
that would result in too high of a collision probability
I know but I meant thatโs where the fais would be
Gains *
@left night In the new VM, I would like to dissalow the following:
Are you okay with that?
I can support it but it gets more complicated in the compiler & slower
?r
#import "@preview/codly:1.3.0": codly-init
#{codly-init = 5}
#codly-init
Because imo this is very cursed
and imports & includes are already a major pain point of the VM
(note that I don't want to prevent shadowing, just overriding)
so basically you want imports to become const?
Yes
if you could i wish breaks continues and returns won't be able on any positions of expressions.
I can't give a clear yes now, but I could imagine it
I agree with you, but I am trying to break as little as possible, the only breaking change (so far) would be:
- glob imports only work on "known" imports (i.e an import of just a string
#import "hello.typ": *or an import based on an global variable#import calc: *but not dynamic imports like:#import "a" + "/b.typ": *), all other imports would need to be named - imported values would be constants
(the first is something that was discussed back in the VM 1.0 days already)
an import based on an global variable
and the global variable needs to be const as well?
is that the reason why imports need to be const?
it can be another imported value
so that import std.calc; import calc: * works?
yep basically
otherwise I need to keep track whether something has been modified
which increases complexity a ton
calc.push(..)
is this intended behavior? ๐ญ
I mean, your approach in making it const fixes it
"fixes" it, I don't know if it needs fixing
I just find the idea of modifying imported variable potentially bad, but mostly it makes my life harder lol
What is a calc.push? Or it simply means a mutable call to some global array.
I was scared for a second that this would work lol
It was just meant to illustrate that it's hard to tell whether a variables is potentially mutated
This would no longer be allowed essentially
No, in this case, imports would be constant
We could introduce consts too, but I am specifically talking about imports being enforced const for now
(again, trying to minimize breaking stuff to make merging easier)
not a fan of const let personally
I would prefer then const x = ... instead of const let but not sure it's needed for typst
I wish we didn't have mutable methods
then it would be trivial to analyze statically whether a variable will be mutated
same, the whole MutAccess thing is so painful
(it's the name in the VM code)
though if there is no mutable method name called on a variable and no assignment, we still know for sure it won't be mutated statically, right?
that could be useful for optimizations
I am more a fan of val/var, so that all variables are on the "4-indent columns", but I didn't shopping the syntax but would like to encourage people use const let instead of mutable let.
yes
currently I am not using this property
but I don't know how to make use of that property (yet)
I guess you will love constant function references, like:
#let f(x) = x
#(f = x => x + 1) // in current typst, function definitions are unfortunately mutable
Currently they would still be mutable
I agree with you overall, but again the fewer breaking changes the VM introduces the more likely we are that Laurenz will merge ๐ซฃ
No, I try to explain what laurmaedje meant. You could check that the references are constant and optimize accordingly.
though if there is no mutable method name called on a variable and no assignment, we still know for sure it won't be mutated statically, right? that could be useful for optimizations
Yes but what optimizations could we do ?
๐ฑ For example, the ultimate inline call optmizations. Not only reduce overhead of comemo, but also elmininate cost to boxing and unboxing arguments.
You can also prepare static references of these variables when compiling to avoid looking up scopes repeatedly.
Most values are already boxed, but this could help reduce copies for arrays and maps with only one reference
I believe most values are not boxed. enum Value lightly wraps integers, floats and many things, and only some typst types are wrapped in a Value::Dyn(Arc<dyn ..>)
Ah right this is not a register VM
I forgot he's implementing a stack vm I was thinking if he had registers for passing arguments then it would be fine mostly
I do plan on redoing the argument system later on by instead passing the last n stack entries
But for args it mostly uses move semantics only copying local vars
The problem is that named variables always make things difficult, currently I have a MakeArgs instruction that collects the argument into one stack entry of type Value::Args but I am thinking of revamping it by "integrating" it into the FuncCall instruction instead, should make my life a bit easier ๐
And it removes an otherwise useless instruction
Ok, I removed the MarkArgs because I thought it was redundant, and removing it should improve performance (but I have not measured obviously)
Current state:
2003 passed, 629 failed, 0 skipped
Ok, last time I did not find a way of doing parallel module compilation
but this time I did!
I think that might unlock the performance to compensate for the initial compilation!
Ok, I am really proud that it actually works, it's a nightmare with lifetimes, but essentially by carrying a rayon::Scope along the entire chain, I can sort of bind the lifetime of the Engine to the lifetime of the scope and pass it around nicely with no lifetime issue, no unsafe nothing, and then I just need to collect the loaded module at the end of compilation (or when they're needed) and collect the sinks at the same time!
Brilliant I love it!
WE MERGE AT MIDNIGHT! ๐คฃ
what the
ASK LAURENZ IT'S HIS CODE!
I just intended it some more
Low key imports are responsible for like 1.5k LOC
My goals of a small VM were crushed by f-ing imports lol
I have line wrap enabled :-p
it's because I often write text in VSCode, so having line wrap helps make it more manageable
Comical.
Agreed lol
I just can't be bothered to make it clean for now
this is the code required to turn deferred constants (i.e constants from imported values) into actual "resolved" constants when compilation ends
Static (parallel) imports work
๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐
@left night I think that just for this, the VM might be worth it ๐
Added the compile timings
Note that this isn't a complete trace as it errors out, but it shows that it's using one thread per file! ๐
very cool
you might be cooking ๐
i do wonder how much % of a performance boost this might bring in practice, but i suppose this can be significant for eval-heavy crates
Well the main gain here is that we know eval-ing bytecode is faster (as per VM1 & VM2) but we know that compiling to bytecode was a significant enough slowdown to basically kill any performance boost from eval in a real world doc
so my hope is that by parallelising the compilation & the evaluation of dependencies of the "main" file, we might run into a scenario where the compilation is short enough to have a significant overall gain
the neat thing is: it's very easy to disable parallelisation if I want to to test its realy impact (and I don't mean -j 1, I mean in the compiler itself)
It gives a lot of room for exploration of:
- Instruction repr (enum vs dyn Trait vs mix of both)
- Parallelization of compile & eval
- Instruction architecture (large instruction vs Bytecode by comparing VM3D to VM2)
Stuff like that
it's a more "interesting" VM than the previous two
But imports are still an ungodly mess unfortunately ๐
Simply because there are two import paths: static (i.e I can know when compiling) and dynamic (i.e I can't know when compiling and resolve during eval)
Oh no, the VM is finding bugs in packages again ๐ญ (namely codly)
2238 passed, 394 failed, 0 skipped
Okay, that's enough for tonight
Getting there!
Just one final trace for tonight showing how it actually evals when it goes far enough!
And you can see how short the compile are compared to the eval for those that don't have huge dependencies (like the third green box from the top in the middle)
glob imports unfortunately greatly decrease this effictiveness since it requires waiting on the module to be evaluated (to be able to declare all of the imports)
but even in that case, we can benefit from parallelism
just glob imports = bad, good to know for codly, cetz, etc. if this thing ever gets merged
maybe you can go as far as possible before depending on a (potentially) imported variable?
that's exactly what I do, I actually "resolve" named imports once compilation is done
but for glob imports, I need to create the list of variables to declare them
and since I just have a * instead of a list of names
I have to busy wait for them to be declared ๐คทโโ๏ธ
not if you do it for every unknown variable
so:
#import "codly.typ": *=> slowest#import "codly.typ"=> fast#import "codly.typ": a, b, c=> fast#include "codly.typ"=> fastest
Yes but then it either "delays" the error at the end, or just slightly delays when I need to wait for the import to be done (i.e when I encounter an unknown variable)
so I don't think the gains are that huge and we still benefit from parallel compilation mostly
one place where we always 100% benefit from parallel compilation is #include "file.typ" since (if the file is constant), it always resolves to a deferred constant!
Which I think is amazing, that means that a project like my thesis that does one include per chapter gets parallel eval "for free"
So if we assume that my 10 chapters have the same eval time, that divides the total eval time by > 10x (since I assume compilation + eval is faster than just eval which was the case for VM1 and VM2)
I am kind of hoping that for larger projects that leads to a 4M decrease of eval time where 4 is the fixed speedup from using the VM and M is the number of chapter files
note that it's eval time not compile time
yeah thats one of the reasons why i was wondering about real %
Unfortunately that's basically impossible to measure :/
layout can often take much much more time, especially with state and stuff
Yes, but think about it this way: every time you re-eval part of the doc to update things (due to state and stuff), you get a significant speedup
because the compiled bytecode is shared
so larger documents with many introspection loop iterations should benefit most (and eval heavy docs in general too)
As well as in watch mode: any closure, etc. should re-use the previous result
(that is not implemented yet however)
yeah i think this can help, no doubt. Let's see how it goes
The parallelization factor will definitely be critical here
Woooo
A flat 15% improvement on the thesis with zero optimizations so far
(cold compile)
Memory usage is about 10% lower too!
About 10% incremental too so far
(and again that's on masterproef which is not an eval-heavy doc)
Time to try the raytracer
well it doesn't compile (yet)
Does it also skip 15% of the document? ๐ง
No the doc seems correct to my very trained eye
Alright, I'm not going to say, we merge at midnight, because you're too close to the finish line. But I will say, this is epic, and I'm glad we get to follow along your vm experiments.
I'm glad this time around, you're seeing some improvements.
We merge at midnight
I checked it again and it looks correct to me, I am sure there are some minute differences most likely, but none that are significant at least
Even the only eval-heavy part (which is a diagram) looks good
2274 passed, 358 failed, 0 skipped
Tests are slowly improving but a lot of the remaining ones are span mismatches because errors have the wrong span, just takes tedious fixing unfortunately
I also can't wait to test all of the different ideas I have to further improve performance, what I find reassuring is that compiling + eval is already quite a bit faster than old eval, but it can definitely be better
2281 passed, 351 failed, 0 skipped
Progress!
We merge at midnight
remember that you're only done when you can run elembic's tests ๐
(please do though cuz im very curious about the potential speedup there, it is very eval-heavy lol, though probs not as much as cetz)
As always, I'm super duper curious about memory usage reduction... Cuz better performance is great, but if the memory is reduced greatly.. that's even better, even if perf gains are small.
do you have plan to make or are you interested in making "customed element" PRs that integrate elembic into rust typst ๐?
i mean, there is definitely a plan to bring its most important functionality (custom elements and types) to the core, no question about that
but whether i will be the one implementing it, or when , or in which way, that's not yet defined
what i can say though is that ive been working on elembic again
i want to publish it to the package manager soon (within 1-2 weeks)
so that will be interesting to see
will be of great help to see how far we can go with the idea at least, and see what would we want to actually bring to the compiler
laurmaedje said it would give preasure to maintain a dashboard of plans, but I only would like to have some brief lists see how html export, customed elements, and other things on roadmap will go concretely. would be great even if we are updating the plan dashboards quarterly or even annually.
rust wg-async has some good list:
https://rust-lang.github.io/wg-async/vision/roadmap.html#overview
This extends too much probably. I only wondered when I saw elembic again: how many steps we should take to achieve custom elements, and how many steps we have achieved.
There are still a few design blockers that need to be resolved
I'll post more about that in https://discord.com/channels/1054443721975922748/1175895383600275516
2303 passed, 329 failed, 0 skipped
Fixing the issue I noted in #discussions (which was a VM bug) improved tests by 22!
It's about the same TBH
Just wanted to share some small performance results on masterproef:
-j 1gives 15% improved cold compile times-j 4gives 15% improved cold compile times-j 8gives 12% improved cold compile times
What I find strange is that I expected the speedup to increase with higher thread count due to parallel compilation
(there were run with hyperfine)
To be fair, the difference in initial eval time (before layout) is:
-j 1: 152 ms-j 8: 135 ms
Which is a surprisingly small improvement judging by the level of parallelism
It tells me something isn't going right with parallelism or the "joining" at the end of the compiler is very slow
could be both tbh
I mean the gains eval on a low-eval doc are 15%
so there's definitely hope for it
but I don't understand why parallel eval doesn't give a better return
Ok, I did some improvement to the joining logic which gives a 15% performance increase at -j 8 too!
To my surprise, using a big enum does not really improve performance ๐ฎ
Like after some layout optimization I get about 16% performance improvement vs 15% previously
so the gains are tiny
In most computers, the hardware will optimize virtual calls heavily and give similar overhead as making a big switch, i.e. both of them cost like a single branch instruction with knowing the branch target.
Yes, I managed to get it to be slightly better but really like 1% of the overall exec time
What is still missing optimizations wise:
- using
PicoStrfor arg & field names (super easy) - using
Cowfor values borrowed from the VM instead of systematically cloning (tedious but easy) - improving the repr of several instructions (i.e making them smol) and of the
CompiledSource(hard-ish) - adding a new type of constant (globals) which are just
&'static Value(easy) - adding the disabling of memoization for small function (especially now that eval is so cheap) (tedious)
- adding the
Readable::Movefor values that are at the end of their lifetime (easy most code is there)
My goals are: a further 30-50% reduction in compile time (i.e AST -> Vec<Instructions>) and a further 30% decrease in eval time
at least those are my goals
My hope is that a doc like masterproef can reduce by 25% total compilation time (i.e complete runtime) and an eval heavy doc could be 2-4x faster hopefully. Don't have a sense for how much time it should reduce in incremental so will test later
adding the disabling of memoization for small function
Is that the reason you started vm3d? I was not actually getting that we gave up and we are now making VM great again.
Well my thinking is that if we saw a slight performance decrease (bad) for huge memory decrease (good), then with the VM being heaps faster, we might be able to get the huge memory decrease with no performance decrease 
I also compared main with my fork on my Desktop compute (7950x3D): (time ยฑ stddev)
-j 1: 2% ยฑ 2% faster-j 4: 9% ยฑ 3% faster-j 8: 8% ยฑ 3% faster
So the gains are much higher on my M4 Max MBP
it's also interesting that typst is much faster on my MBP than my Desktop

I have a feeling it's a memory latency thing
The more I compare my MBP to my desktop the more I want to buy an M3 Ultra Mac Studio
Alright. Are you comparing linux to macos, or do you have stinky windows on your desktop? Cuz, I can tell you... Windows sucks.
Windows ๐ญ
Stop... Stop comparing, and go back to coding. Jesus Christ.... I have an airfrier in my kitchen that can beat any desktop with Windows on it..
I'm doing both 
And building is so slooooooooooooooooooooooooooooooooooooooooooooooow
but I think it's more of a Windows issue than a my PC is slow issue
You know what sucks? Linux drivers ๐
That's why we go Mac now. It's done. The war is over.
๐ โโ๏ธ
Ok, so interning of arg names (at -j 8):
- 7950X3D: 8% -> 13%
- MBP: 9% -> 12%
Note that the stddev is often ~3% so this is barely outside of the margin of error :/
At -j 1 on the MBP the gains are negligeable 15% -> 16% and on the desktop: almost none
But this is all on masterproef, gotta try on notes.typ, PgSuper's test doc, and raytracer.typ then do in-depth comparison
With enough simples to see the actual distributions
Might be a better data prefetching pipeline?
Yes but then the amd cpu should be better no?
Like typst is almost 50% on my MBP
And even more so with the VM
Did you force it to run on the 3d cache ccd?
I didn't do any pinning ๐
Too lazy, especially on Windows
@sturdy sequoia I have a problem on profiling tinymist using typst-timing. cli resolves spans after each compilation, but tinymist cannot do with the same way because it runs typst tasks concurrently. Speaking two worlds X and Y run at the same time and have different content (revision) of a same source file, where a span of the same raw number may point to two physical file location. The recorded span from X and Y of the source file will be mixed together and I cannot resolve them by the correct world. Any idea? Btw, I have applied typst-timing to the analyzers and it works really well. typst-ide could also do that I think.
https://github.com/andylokandy/smallbox is this something that could be useful for typst?
oh I see it was proposed once but was problematic because of unsafe use
We could add a thread ID to the spans, that should resolve it
right?
maybe a compile task ID, because typst's layout can be dispatched to rayon threads.
Right, that might make sense
We could add a thread_local! for a task ID in Typst
But right now we do use thread IDs actually
Since we can see multithreaded traces from Typst rayon dispatch
so I am not sure what is the source of this issue in tinymist
I run multiple worlds that map same FileId to different content, which make span resolving problematic.
Wow
Why?
But we could add a task_id in addition to the thread, but I don't think that would really solve the issue
I guess we could do one recording per "task"
For example, I compile thousands of docstrings in parallel and every docstring's main has file id main.typ
I actual don't know how to introduce a "task_id" for typst::compile.
My current solution is simply try resolving otherwise giving up. This is not good but does almost work.
@sturdy sequoia surely there's a pre-existing issue for https://github.com/typst/typst/issues/6508, but I can't find it. You would know I guess
I donโt think there is, it has been discussed ad nauseum on the Discord, but I donโt think an issue has ever been opened. I would be very curious if they have a good example document for this. Because I couldnโt show a huge advantage overall. An advantage? Yes, especially on memory usage but not much on performance
#1176509648707256370 message
There are some benchmarks there, but I didn't read it because LLM walls of text make me want to rip my eyes out
Is it an LLM wall of text?
Oh, I personally didn't get that impression (is it the en-dashes?)
-# seems like i need to re-tune my AI detector
Like 110%
Ok fair
But like all he shows in the benchmark seems to be amortisation of bigger function calls?
Is it something we didnโt already know?
I think the main complaint here is memory usage. Not sure exactly what they claimed speedup is
I think it's been edited in part after, so it's not the raw response, but if it's not LLM then OP has gone to the chatgpt school of writing โข๏ธ
Anyway I could swear there was an issue, but I guess it's just been discussed to death here on discord
Maybe this issue? Though in this case it only applies to plugins: https://github.com/typst/typst/issues/2309
I didn't get the impression that it is written by LLM tbh
Either way, they did quite a bit of measurements, which I can appreciate
I'm like McCarthy but for LLMs instead of communism
That's one raaaaaad issue.
As some of you have also experienced, it's quite tricky to benchmark the Typst compiler because repeated runs have high variance in wall clock times (sometimes between 5-10%). This makes it really tricky to make any kind of statements about performance improvements or regressions as very few changes have such drastic impact.
This was really quite frustrating. So I looked at what the Rust project does to benchmark rustc and tried to replicate it by
- renting a dedicated Hetzner server where we have 100% control over the OS
- tweaking it to reduce noise as much as possible (no ASLR, no hyper-threading, core isolation, dylib interposition to override
getrandomwith a deterministic version); most of these tricks directly follow what Rust does - built some infrastructure to perform benchmarks and compare measurements (I've had some fun by building the comparisons via Typst-powered server-side rendering)
It's all still quite rough, but the results are very promising. Below, there are two runs of the exact same binary. As you can see clock time varies a bit, but the instruction count is really almost exactly the same. (Here, the difference is literally just 1. Typically it's a bit more, but not more than ~0.0005% variance.)
Is this something that will eventually be part of CI in some way?
Likely in some capacity, but I don't think it would need to run on every PR
And if it should run on user PRs, I also need to make sure that it doesn't compromise the server
Does it include export in different target? ๐
Only PDF at the moment
@left night cooking something in comemo-prescient?
comemo-prescient is a testbed to check what the theoretical limit for optimizing comemo is. it runs the compiler twice, first recording the exact the hit/miss sequence and the second time, it already knows exactly how it's gonna play out and does not need to do any comemo tracking (which has significant overhead).
I'm in the process of fixing https://github.com/typst/typst/issues/5220
what do you mean by hit/miss sequence? As in what is actually reused versus not?
which calls are a cache hit vs not. in reality, comemo has to do a bunch of stuff to validate whether there is a cache entry that is a hit. in this branch it already knows whether it will be a hit or not as soon as it enters a memoized function call.
on main, the validation can take quadratic time (that's the issue). I've implemented something that can validate it in linear time with a tree data structure. this makes incremental somewhat faster on average and much faster in the cases where the issue comes into play, but it slightly regressed some clean compiles and I was trying to figure out why.
That'e exciting. I think incremental compilation speed is more important than cold/clean, but regressions are obviously not ideal anyway
interestingly it seems to be a slight regression in instruction count, but not in wall clock time
at least it appears that way. the thing is, wall clock time is noiser so it's a bit harder to say for sure
it could be that the instructions used in the one case are more expensive
but optimizing for instruction count is much nicer as it's super stable
it could probably
but optimizing for the benchmark server is good enough for me and the rest will behave reasonably similar I think
how much all of this caching stuff matters also highly depends on the document
if it's mostly text, it doesn't matter much; if it uses a ton of introspection, comemo overhead starts taking up a fair chunk of the compile time
As the package ecosystem grows and you have users blindly using a ton of them it's likely that even seemingly simple documents may have a lot of introspection
yeah
I've seen some...creative solutions
at least, with the changes I'm implementing, performance will not randomly degrade quadratically
if you have some contextual construct that appears a few hundred times in your document, that can already turn quadratic cache validation into a significant amount of the incremental compile time
Like inlining the entire package in one 1k+ LOC function? ๐
Can I have a bit of a rundown? Just curious ๐
the problem and solution are described in the https://discord.com/channels/1054443721975922748/1325795747165114479 forge if you read from the start. the forge isn't that long in total.
(Not sure if this is the right place) We have Rust code that uses Typst as a library and we recently implemented generating multiple pdf files in a loop from the same Typst code, with a different JSON file input for each iteration. We noticed that each iteration compiles significantly slower than the previous one. It turns out that adding comemo::evict(0); after each iteration fixes this, so it looks like Typst slows down once the cache grows beyond a certain threshold. Do you think this is related to what the quadratic cache validation posted above, or some other issue?
Quite likely. But luckily the quadratic cache issue will be gone soon and then you can test :)
Sounds good, thanks!
@left night Congrats on the performance improvements, on masterproef with -j 8 I am seeing around 4% improvement on my MBP, and in -j 1 I am not seeing improvements. This is on cold, but it's cool to see. I'll try watch mode later because I am going to the gym in five minutes ๐ช lol
I use masterproef in my benchmark setup actually and it's the document that performed worse with the comemo changes ^^
I assume the 4% are from faster hashing
In clean compiles, that is. In watch mode with -j 1, it was 15% faster.
I'm guessing the exact numbers will depend on hardware
I feel like masterproef is too text heavy to benefit from these things as much as other more recent documents. In a way masterproef was written when typst was a lot more basic than it is now and with those limitations in mind (remember the whole 27 GiB of ram to compile thing ๐คฃ)
So in a way I donโt find it surprising.
To be fair many of the features I contributed and other improvements to performance were tailored for masterproef so maybe it has just become a document for which typst is very heavily optimised.
@left night I did test watch mode and it does seem even 20% faster on masterproef! Really congratulations ๐ I can't wait to test it with the vm3d too, I need to finish updating it 
https://github.com/orxfun/orx-parallel is this something that could be useful for typst?
Claims to be much faster than rayon
Probably not, I don't think that rayons overhead currently contributes enough to warrant a switch
something that competes with rayon sounds very interesting in general
The benchmarks seem interesting, they also noted that they don't use container-specific implementations, which is why some container-specific rayon implementations are necessarily faster. So we'd only benefit from a switch in cases where rayon itself is the bottleneck, as I understand it. My understand is that, since layout can be fairly expensive, the rayon overhead should be negligible.
Maybe some things that aren't currently parallelized because the overhead would be too big?
I don't know. Just thought I'd mention it here
I think trying it out is definitely a good idea
Trying it out is definitely worth it imo
But I don't think this'll make a huge difference
perhaps some of the Deferred<T> will be a bit faster with less overhead
but that's about it
The performance guru has spoken
Well it's always worth it ๐
@silent I think that crown goes to @left night now with the comemo changes
nooooooo
I heck-ed it up, sorry Laurenz meant to silent ping
You woke the sleeping dragon
Depending on if Typst hasher inputs tend to be big or small, https://github.com/hoxxep/rapidhash might improve performance a bit
Could perhaps be worth a shot
(Chromium and a few other well known projects use the C++ version of the same hash function)
Is this beyond what was done with #6678 last week? That changed the hash function to rustc-hash, a polynomial hash with inspiration from wyhash.
Seems like it'd be a marginal improvement within typst for bringing in a pretty new dependency with few users so far https://github.com/hoxxep/rapidhash?tab=readme-ov-file#benchmarks
That one's fxhash. There are benchmarks in the rapidhash repo comparing the two
Maybe I'm misreading things, but that seems to be 64 bit only?
According to the readme it's platform-independent
I mean the hash is 64 bit
Doesn't typst use 128?
it might be rustc-hash that should be the item to look at (for current typst git hasher) in the benchmarks from that github repo I guess
Not everywhere: https://github.com/typst/typst/pull/6678
And with the comemo thing mentioned in the PR, I'm not sure it uses 128 bit hashes anywhere now
The performance improvements are mainly interesting if typst likes to hash large chunks of data
(Example: images)
I guess laurmaedje is the only person that really has the tools to analyze the real-world performance impact of this currently and would have to be the one to make the choice ๐
looks interesting certainly!
I thought 128 bit was crucial to avoid collisions?
Based on what I can tell (not a rust dev), comemo is still using 128 bit hashes
the Rust stdlib interface for hashing only works with u64 final values anyway, so that's also what a Rust HashMap uses, a 64-bit hash. The internal state of the hash algo could be bigger of course
siphash's internal state is 128 bits (Rust's default hasher)
oh 128 bit is the key size the internal state is bigger
It's hashes in two different places, that explains the collision of concepts
pub struct CacheData<C, Out> {
/// Maps from hashes to memoized results.
entries: FxHashMap<u128, Vec<CacheEntry<C, Out>>>,
}
The u128 value is a hash. The hash will be (64-bit) hashed to find its slot as a key in the hash map. That's the two layers.
I see, that's good to know and does indicate that rapidhash isn't the best choice there
I don't know the details but if the u128 value is already a good hash value, wouldn't the trivial hash function (truncate 128-bit to 64-bits) be faster and so on?
that's for hash map performance only
siphasher is for resistant hashes, rustc-hash for fast collision-is-ok hashes
yeah probably
Whether or not the truncated value would be a good 64 bit hash would depend on the algorithm used for the 128 bit hash no?
I can't see that being automatic
but regardless the rustc-hash is also used in a few of other places in comemo
Though one interesting thing to note is that rapidhash passes all of SMHasher3, while fxhash doesn't
then "good hash" in the conditional has to ensure all that ๐
we think that siphash has very nice properties, it should be able to be truncated as you want
saw it as well lol
I wonder why not use a passthrough hasher then, since the keys are already high quality hashes?
see here: #1176509648707256370 message
#1176509648707256370 message
the answer is: nobody did it
needs some testing. nohash-hasher sometimes makes accesses to hashmap slower for bad distribution but worth a try.
I did it now, just need to figure out how to benchmark it https://github.com/typst/comemo/pull/18
in this case the u128 is a siphash output so it should be at least as good as the hash was before (the change to rustc-hash)
if you give me a git commit hash of a [patch]ed typst version and the git commit hash of the baseline typst version, I can start a run
working on it
Then I suggest we test (new typst) 459a8654437902f68e2f753a171199c2b3482359 vs (main/previous commit typst) 6177c1b22df6d675646102247eb4382a3f1db60d
Maybe I should make a PR vs the typst repo so that the commit appears there and not just in my fork https://github.com/bluss/typst ?
no, it's fine
ok, benchmarks are queued, will take ~10-20mins

is that benchmarking tool typst agnostic? could imagine that there are other uses cases where this could be useful ๐
say for example to benchmark a PDF renderer ๐
currently, it is fairly Typst specific
but it could be extended I guess
the denoised setup is not Typst specific, just the infra around it
unfortunately, it seems to barely make a difference (for Typst), just ~0.05% fewer instructions on average
oh ok, that was quick response. Did we do something wrong? It also replaced all the FxHashMaps except for this one: edges: FxHashMap<(InnerId, u128), NodeId> because it's the one with more than just u128 as a key
the diff looks correct to me
probably, it's just not a bottleneck anymore, at least for Typst
that raw gains in comemo could be different, though they're probably also not that large because comemo overhead is a considerable part of Typst compilation in some of the benchmark documents
aha, now we know... you don't have any microbenchmarks in the comemo repo either? Then you can't be misled by them ๐ง
no, there are no comemo-specific benchmarks
but it's also not used by anybody notable except Typst afaik
if fxhash was a 2-5% win; i'd have a priori expected a similar or maybe 2% win with this too. I spot missing inline directives and the only thing that might have ruined the first try.
Is there a benchmark for typst or comemo I can run locally to verify a difference?
For me, locally was always too noisy
Isn't #[inline] within a single crate not really needed anymore
But it's no problem, I can start another run if you'd like
I'm not sure, because for example CallSequence is generic, isn't it monomorphized into typst somewhere, and then it still becomes cross crate(?)
If you don't mind, the new typst commit is 8a7853cf85590c6aabb26d53d4789e9080376021 vs (main/previous commit typst) 6177c1b22df6d675646102247eb4382a3f1db60d (same as before)
just running it seems to be the easiest way to know
matklad sums it up https://matklad.github.io/2021/07/09/inline-in-rust.html "Thereโs a lot of tribal knowledge surrounding #[inline] attribute in Rust." ๐
There's a lot of tribal knowledge surrounding #[inline] attribute in Rust.
I often find myself teaching how it works, so I finally decided to write this down.
Well I mean, even then, using a passthrough hasher surely can't be a bad thing, right? It's not like it matters in the grand scheme of things?
(code/binary size wise I mean)
And I would expect the gains, small as they may be, to increase in watch mode
And I don't think the current benchmarking harness Laurenz has made does watch mode?
I wonder if a small one could be declared to avoid dependencies: one that just hash a u128 and turns it into XOR-ing the u64 pieces? ๐คทโโ๏ธ
the WIP one I did is just a file, it's like 15 lines of code and it truncates the u128 to 64-bits. So basically it's done like that already
would be more satisfying to be able to solve this one too FxHashMap<(InnerId, u128), NodeId> but it can't be a passthrough anyway so maybe it's not worth more custom code
(InnerId is a usize)
the #[inline] annotation shows ~0.000% difference
but thin-lto is also enabled on release builds
ok ok. thanks a lot for running it
thin lto seems to be enough to enable inlining anyway then (with codegen-units=1)
fwiw, I am not per se against this change even if the gains are little
no, because I'm too lazy :(
also, the benchmarking harness uses only a small amount of documents
I'd kinda need a better collection
I was considering to ask people for them, but at the very least, it'd need some clarity on licensing
as always would be good to understand what's going on. When I'm looking in typst/typst (git) there are zero matches for "benchmark". It would be nice if there was a hint where to look for existing benchmarks, or which testcase to use
Could you use random documents from users of typst.app?
Could you share it with me the repo such that I could try and see if watch could be added? ๐
That would be against our Terms of Service
And would be wildly problematic imo
it's kinda held together by duct tape and doesn't really have a proper deployment stuff, it's literally a deploy.sh that needs your local shell to be able to ssh into root on the dedicated server
I don't know... As long as the contents aren't published anywhere, I don't expect ful privacy given that documents are encrypted I think
what's the current pointer towards benchmarking if we want to do some experiments, find our own documents to test?
oh wow
I would rather see an opt-in system for this then. I definitely have documents I would be willing to share for the purpose of benchmarking
but definitely not opt-out
we can access document contents (both to support people and also because building a collaborative encrypted online document editor is actually super hard; I know of basically only one, by Proton), but do so only to support people and enforce the ToS
if it were encrypted to the point that we couldn't access it, we couldn't run benchmarks either, so I'm not sure I understand the idea
but anyway, I don't think it's necessary
just a few representative documents should be more than enough
Necessary? No, I don't think so either
I just want a public repo with a few test cases and with proper licenses on the documents
I know y'all use at least masterproef in it hรฉhรฉ, still the document for which Typst is the most optimized ๐
I need to re-add one for masterproef, it used to have one then I accidentally yeeted it
masterproef repo doesn't have a license either, so I could not commit it and am probably already in violation for using it :(
It used to have one in the repo
I need to upload one
or re-upload one
As far as I know the content is under some Creative Commons license by the uni
there isn't really anything. I've only just set up the dedicated machine recently myself.
and unfortunately I don't really have the time currently to build it into a proper system usable by anyone
I was thinking it doesn't need to be as good as your main benchmarking setup. But if the repo of documents doesn't even exist publically, then that's how it is now
a benchmark corpus should preferably be stable so that you can compare versions. Then you need to port a little bit if there are breaking changes
Ok @left night crazy idea: spin up a QEMU emulated VM, and count instructions of just typst, that way it's a "pure" instruction count? Doesn't take into account the effect of memory layout - that we know are important, see Content Rework 2 - but at least it should give a rough idea. ๐ (it's horrible I know)
I regularly update my thesis because it gets used for bencharmking, at least by myself
I have these basic cases: A cetz diagram, hello world, and a ton of lorem ipsum.
Then masterproef
And then one math-heavy document that I cannot share
That's it currently
I mean, that would maybe also work.
I have some slides I could share, I'd need to ask for permission but they have been presented outside of the company quite a few times before at events/conferences so it should be fine. (I'll ask on Thursday). They're nice and slow because of introspection 
Speaking of. Did you see the rayon alternative mentioned earlier?
How about some of the more beefy package manuals?
I didn't really look at it, but I have my doubts whether rayon is at all a bottleneck.
The difference in performance seemed very substantial, but I guess it's very situational based on what the limiting factor is
How powerful is the benchmarking server?
The codly one is rather chonky ๐
Low key I thought you guys would've bought a base Mac Mini M1/2/3/4 and shoved it in a closet in your office/at home lol
I'm not sure how representative macs are for performance in general
Is CC-BY-SA alright, or would it need to be CC-BY?
Surprisingly, compared to my 7950x3d, any change to instructions count leads to the same speedup (say 10% for both) but memory speedups are bigger on the mac (like cache efficiency stuff)
both would work
I just want to be able to compile it and commit it to a repo
I think it could only be applied to a very small number of structs in Typst
In guessing this could eventually substantially speed up the web app https://github.com/WebAssembly/wide-arithmetic/blob/main/proposals%2Fwide-arithmetic%2FOverview.md
Currently it's not worth the effort but at some point we could have separate wasm blobs for newer and older browsers.
I don't think a single browser supports it yet anyway
@left night do you know if your recent changes will speed up https://typst.app/universe/package/equate/ ?
No idea
Oh, speeding up the equate package would be so great ๐
How much time is spent on shaping in a typical document? Asking since harfrust is supposedly like 3 times faster than rustybuzz
Can you read minds?
Because I have literally today gotten an email from Behdad where he was inquiring about how much shaping is a bottleneck in Typst.
Copying a part of my response here:
- On a long document that consists only of Lorem ipsum, just the shaping with rustybuzz (excluding shape plan creation) made up ~25% of total compile time (including PDF export which is about 1/3 of compilation there). For previews, PDF export is skipped, so there it would be a little more. This was with naive first-fit line breaking. With optimized line breaks, it would become less, at least as long as the line boundaries are safe to break.
- On a document that uses more of Typst's other features, shaping quickly becomes a much smaller factor. On one thesis document I tried, it was only 3% of compile time (still with rustybuzz).
Maybe
That should still be a decent performance boost for text heavy documents. Maybe also memory usage if that's improved?
But not game changing
Is masterproef also in the 3% neighbourhood? I really can't recall, but I think it's a fair bit heavier isn't it?
masterproef was the 3% doc ๐
Fair lol
Harfrust is now only 16% slower than Harfbuzz based on the geometric mean of the latest update in their spreadsheet
Compared to rustybuzz which is like 3-4 times slower
Apparently less even https://github.com/linebender/resvg/pull/922#issuecomment-3275567195
Was there a behchmarking script somewhere or am I misremembering?
IIRC the benchmarking stuff has more to do with steps towards consistent results than specialty stuff around Typst itself: #1176509648707256370 message. I don't think anyone outside of the team currently has that set up, but in the past they've helped get various patches tested #1176509648707256370 message
for less critical stuff / initial testing you can probably follow standard practices.
The first link is probably what I had in mind. Well at least now I know what it was about, thanks
I mean there is also this #1176509648707256370 message (right below you can find some documents that are used for it), which I guess fits "benchmarking script" better, but its just a hacky interface to the former ๐
There are a couple of things users have tried in this thread but I dont think those have been really validated yet.
Wishing u the best with your shenanigans
Just rechecked our use case (#1176509648707256370 message) with Typst 0.14 and Comemo 0.5, unfortunately the issue still exists so we still have to evict comemo cache after each iteration. There is noticable increase in memory usage in both debug and release builds, and in debug builds the pdf generation still gets noticably slower after a few dozen iterations.
Could you share a minimal reproduction?
Yes I can create one (but probably not very soon)