#Performance

1 messages ยท Page 9 of 1

sturdy sequoia
#

haha yes

#

Gotta start somewhere, the test suite is too large to effectively debug with initially

#

as I go on I'll use that but first I start simple

#

(remember that initially zero test pass, then 1k, then 2k, etc. it's not test-by-test thankfully)

placid pivot
#

Good luck krkrkr

sturdy sequoia
#

1008 passed, 1624 failed, 0 skipped
It's something

sly pecan
#

Somehow I'm surprised typst has only 2632 tests

cunning wadi
#

But also there are not enough tests

#

(There never are)

sturdy sequoia
proud sandal
#

No tests means you can silently break stuff and no one will know ๐Ÿ˜ˆ

untold turret
#

i didn't know I added 500+ tests for the language server. Adding tests is a heavy work tbh. In that meaning, 3000+ tests is huge. But the giant repo stories like rust and llvm do have much more insane number of tests.

feral imp
#

I know you're tired of working on the VM, but the VM Chronicles that you've been posting here, have really been better than Netflix. I'll miss it when vm3d is merged. Hopefully laurmadje reviews it for a good few episodes.

sly pecan
#

There'll be a 2 Virtual 4 Machine after

#

Or Virtual M4chine

#

Then Virtual Machine with the V stylized as a roman numeral

#

then VIrtual machine

sturdy sequoia
#

As I am debugging I keep printing instructions to debug them, so I finally made them "pretty":

[ 0] AllocateArray { size_hint: 4 }
[ 1] ArrayPush { value: Int(1), span: Span(285158810540467) }
[ 2] ArrayPush { value: Int(2), span: Span(285609892233913) }
[ 3] ArrayPush { value: Int(3), span: Span(286060973927359) }
[ 4] ArrayPush { value: Int(1), span: Span(286512055620805) }
[ 5] Iter { iterable: Stack, range: PointerRange(6, 38), span: Span(284858089411503), content: false }
>>>[ 6] Next { span: Span(283730385177888), pattern: Single(MutAccess { head: 0, segments: [], span: Span(283730385177888) }) }
   [ 7] Eq { lhs: Slot(0), rhs: Int(1), span: Span(290069022724325) }
   [ 8] Conditional { condition: Stack, cond_span: Span(290069022724325), end_if: Pointer(15) }
   [ 9] Scoped { range: PointerRange(10, 14), span: Span(290200588218245), content: false, looping: false }
   >>>[10] AllocateArray { size_hint: 2 }
      [11] ArrayPush { value: Const(0), span: Span(290234654283635) }
      [12] ArrayPush { value: Const(1), span: Span(290241702435095) }
      [13] Join { value: Stack, span: Span(290229955515995) }
<<<[14] Jump { end: Pointer(25), span: Span(290017336280285) }
   [15] Eq { lhs: Slot(0), rhs: Int(2), span: Span(290527152569225) }
   [16] Conditional { condition: Stack, cond_span: Span(290527152569225), end_if: Pointer(23) }
   [17] Scoped { range: PointerRange(18, 22), span: Span(290592935316185), content: false, looping: false }
   >>>[18] AllocateArray { size_hint: 2 }
      [19] ArrayPush { value: Const(2), span: Span(290609968348880) }
      [20] ArrayPush { value: Const(3), span: Span(290613492424610) }
      [21] Join { value: Stack, span: Span(290607618965060) }
<<<[22] Jump { end: Pointer(25), span: Span(290501309347205) }
   [23] Scoped { range: PointerRange(24, 25), span: Span(290743295880665), content: false, looping: false }
   >>>[24] ContinueOp
<<<[25] FlowOp
   [26] Destructure { value: Stack, pattern: Items([Single(MutAccess { head: 1, segments: [], span: Span(289679025010205) }), Single(MutAccess { head: 2, segments: [], span: Span(289735410221885) })], Span(289641434869085)), span: Span(289538061981005) }
   [27] Scoped { range: PointerRange(28, 36), span: Span(295101402866839), content: true, looping: false }
   >>>[28] Join { value: Slot(1), span: Span(295204775754919) }
      [29] Join { value: Const(4), span: Span(295223570825479) }
      [30] Join { value: Const(5), span: Span(295242365896039) }
      [31] Join { value: Const(4), span: Span(295261160966599) }
      [32] Join { value: Slot(2), span: Span(295298751107719) }
      [33] Join { value: Const(4), span: Span(295317546178279) }
      [34] Join { value: Const(6), span: Span(295336341248839) }
      [35] Join { value: Const(4), span: Span(295355136319399) }
<<<[36] Join { value: Stack, span: Span(295101402866839) }
   [37] Jump { end: Pointer(6), span: Span(282828221790996) }
<<<[38] Join { value: Stack, span: Span(282828221790996) }
sturdy sequoia
#

Like it's the best VM so far

#

Don't know about performance ||yet||, but in terms of quality for sure

untold turret
#

I need a cool bytecode printer here

sturdy sequoia
#

And once we get to episode 995 it'll be just VM

sturdy sequoia
#

I am curious which performs better

#

Also, currently constants are indexed in a common array (for the currently compiled code), but I am thinking of just doing deduplication of constants and not storing IDs

untold turret
#

why not enum Inst?

sturdy sequoia
#

Also also, currently the VM uses a Readable enum for the different types of data (see below) and I might make it have a Const(Value) type.

#[derive(Debug, Clone, Copy, PartialEq, Hash)]
pub enum Readable {
    None,
    Stack,
    Auto,
    Slot(usize), // i.e read a local var
    Take(usize), // i.e take ownership of a local var
    Const(usize),
    Bool(bool),
    Int(i64),
    Float(Scalar),
    Std(usize),
    Math(usize),
}
sturdy sequoia
untold turret
#

oh I ignored it

sturdy sequoia
#

for dev, the Arc<dyn Instruction> is by far the easiest

#

But it's probably garbage due to the two levels of indirection

#

also adds complexity in other places

#

Will probably make a macro for declaring the instructions again ๐Ÿคทโ€โ™‚๏ธ

sturdy sequoia
proud sandal
#

Oh because of the dynamic dispatch

#

Oh

sturdy sequoia
#

so one level from the vec, then one level to the arg,

#

then yeah actually

#

the dynamic dispatch too lol

proud sandal
sturdy sequoia
#

haha cache goes ๐Ÿ’ฉ

#

But the compiler of the new VM is heaps and heaps simpler

proud sandal
#

I'll be taking notes when the PR goes live

sturdy sequoia
#

just have these magic compiler.flow() calls everywhere which I H-A-T-E but Typst has very peculiar control flow semantics which I can't find a better way of doing in compiled code ๐Ÿคทโ€โ™‚๏ธ

#

(it adds a special FlowOp if the previous instruction does not cause control flow)

#

and many instructions can have control flow (which wasn't the case with the old VM where there was only FlowOp)

#

so it might be a bit faster

untold turret
sturdy sequoia
untold turret
#

this is actual the best plan.

#

๐Ÿ˜„

proud sandal
#

could also try a mixture of both, having hot paths without indirection as enum members with an "unlikely" member that uses dyn Trait

#

which one is which is best found out on real documents

sturdy sequoia
#

Set & Show rules are in!

sturdy sequoia
#

(i.e ones where the struct is big)

#

1831 passed, 801 failed, 0 skipped

#

I'll add all of the math stuff now before it keeps bugging me lol

feral imp
sturdy sequoia
#

1836 passed, 796 failed, 0 skipped
I am very dissapointed in how few test adding math fixed

#

That means that most of math is broken lol

#

1900 passed, 732 failed, 0 skipped
Fixed one bug, just 732 more to go? thonk2

#

Should probably add context next

#

1983 passed, 649 failed, 0 skipped

#

Indeed, it helped

feral imp
#

Oh boy. Midnight might have been too ambitious.

sturdy sequoia
sturdy sequoia
#

lol

sly pecan
#

๐Ÿง 

sturdy sequoia
sly pecan
#

@sturdy sequoia I'm curious, approximately how much of execution time is spent comparing hashes?

#

Not talking the vm specifically

#

Iirc a 128 bit hash is used. I'm wondering if there would be a minor performance gain in only comparing the first 64 bits unless a collision is detected

sturdy sequoia
sly pecan
#

I'm specifically talking about the comparison aspect

sturdy sequoia
#

then two u64 would likely be slower

#

because SIMD

sly pecan
#

You can treat a 128 bit hash as two 64 bit ones though. You would almost never have to check the other half of it

#

Man my phone's auto completion is driving me nuts

#

Does that make sense?

sturdy sequoia
#

so I doubt it would actually be faster

#

and since it's 128-bit, there should be no clock division on x86 (don't know on arm)

#

(and even the clock division would only be on Intel CPUs)

#

And actually that's only valid for AVX512 so it doesn't matter

#

I bet direct comp at 128-bit is faster

sly pecan
cunning wadi
#

and well, it still checks the other half

#

but not doing this would be slower because conditionals are slower than just a simple xor and or

sturdy sequoia
#

not integer 128-bit, they are pretty darn optimized

#

but best way would be direct compare

sly pecan
#

๐Ÿ˜ญ

cunning wadi
#

I'm keeping this error

sly pecan
left night
sturdy sequoia
#

Gains *

sturdy sequoia
#

@left night In the new VM, I would like to dissalow the following:

#

Are you okay with that?

#

I can support it but it gets more complicated in the compiler & slower

#

?r

#import "@preview/codly:1.3.0": codly-init

#{codly-init = 5}

#codly-init
delicate fossilBOT
sturdy sequoia
#

Because imo this is very cursed

#

and imports & includes are already a major pain point of the VM

#

(note that I don't want to prevent shadowing, just overriding)

left night
#

so basically you want imports to become const?

sturdy sequoia
#

Yes

untold turret
#

if you could i wish breaks continues and returns won't be able on any positions of expressions.

left night
sturdy sequoia
#

(the first is something that was discussed back in the VM 1.0 days already)

left night
#

an import based on an global variable
and the global variable needs to be const as well?

#

is that the reason why imports need to be const?

sturdy sequoia
left night
#

so that import std.calc; import calc: * works?

sturdy sequoia
#

otherwise I need to keep track whether something has been modified

#

which increases complexity a ton

left night
#

calc.push(..)

sturdy sequoia
left night
#

I mean, your approach in making it const fixes it

sturdy sequoia
#

I just find the idea of modifying imported variable potentially bad, but mostly it makes my life harder lol

untold turret
#

What is a calc.push? Or it simply means a mutable call to some global array.

sturdy sequoia
#

I was scared for a second that this would work lol

left night
sturdy sequoia
#

This would no longer be allowed essentially

untold turret
#

do you value to introduce a const?

#

const let

sturdy sequoia
#

No, in this case, imports would be constant

#

We could introduce consts too, but I am specifically talking about imports being enforced const for now

#

(again, trying to minimize breaking stuff to make merging easier)

left night
#

not a fan of const let personally

sturdy sequoia
#

I would prefer then const x = ... instead of const let but not sure it's needed for typst

left night
#

I wish we didn't have mutable methods

#

then it would be trivial to analyze statically whether a variable will be mutated

sturdy sequoia
#

(it's the name in the VM code)

left night
#

though if there is no mutable method name called on a variable and no assignment, we still know for sure it won't be mutated statically, right?

#

that could be useful for optimizations

untold turret
#

I am more a fan of val/var, so that all variables are on the "4-indent columns", but I didn't shopping the syntax but would like to encourage people use const let instead of mutable let.

sturdy sequoia
#

currently I am not using this property

#

but I don't know how to make use of that property (yet)

untold turret
sturdy sequoia
#

I agree with you overall, but again the fewer breaking changes the VM introduces the more likely we are that Laurenz will merge ๐Ÿซฃ

untold turret
sturdy sequoia
untold turret
#

๐Ÿฑ For example, the ultimate inline call optmizations. Not only reduce overhead of comemo, but also elmininate cost to boxing and unboxing arguments.

#

You can also prepare static references of these variables when compiling to avoid looking up scopes repeatedly.

proud sandal
#

Most values are already boxed, but this could help reduce copies for arrays and maps with only one reference

untold turret
proud sandal
#

Ah right this is not a register VM

#

I forgot he's implementing a stack vm I was thinking if he had registers for passing arguments then it would be fine mostly

sturdy sequoia
#

But for args it mostly uses move semantics only copying local vars

#

The problem is that named variables always make things difficult, currently I have a MakeArgs instruction that collects the argument into one stack entry of type Value::Args but I am thinking of revamping it by "integrating" it into the FuncCall instruction instead, should make my life a bit easier ๐Ÿ˜†

#

And it removes an otherwise useless instruction

#

Ok, I removed the MarkArgs because I thought it was redundant, and removing it should improve performance (but I have not measured obviously)

#

Current state:

2003 passed, 629 failed, 0 skipped

#

Ok, last time I did not find a way of doing parallel module compilation

#

but this time I did!

#

I think that might unlock the performance to compensate for the initial compilation!

sturdy sequoia
#

Ok, I am really proud that it actually works, it's a nightmare with lifetimes, but essentially by carrying a rayon::Scope along the entire chain, I can sort of bind the lifetime of the Engine to the lifetime of the scope and pass it around nicely with no lifetime issue, no unsafe nothing, and then I just need to collect the loaded module at the end of compilation (or when they're needed) and collect the sinks at the same time!

#

Brilliant I love it!

feral imp
#

WE MERGE AT MIDNIGHT! ๐Ÿคฃ

sturdy sequoia
#

Forgive me for I have sinned ๐Ÿ’€

#

I told you impors were tricky ๐Ÿ˜„

proud sandal
#

what the

sturdy sequoia
#

I just intended it some more

#

Low key imports are responsible for like 1.5k LOC

proud sandal
#

dedented some of it too

#

for good measure

sturdy sequoia
#

My goals of a small VM were crushed by f-ing imports lol

sturdy sequoia
#

it's because I often write text in VSCode, so having line wrap helps make it more manageable

feral imp
sturdy sequoia
#

I just can't be bothered to make it clean for now

#

this is the code required to turn deferred constants (i.e constants from imported values) into actual "resolved" constants when compilation ends

sturdy sequoia
#

Static (parallel) imports work

#

๐Ÿš€ ๐Ÿš€ ๐Ÿš€ ๐Ÿš€ ๐Ÿš€ ๐Ÿš€ ๐Ÿš€ ๐Ÿš€ ๐Ÿš€ ๐Ÿš€ ๐Ÿš€ ๐Ÿš€ ๐Ÿš€ ๐Ÿš€ ๐Ÿš€ ๐Ÿš€ ๐Ÿš€ ๐Ÿš€ ๐Ÿš€

#

@left night I think that just for this, the VM might be worth it ๐Ÿ˜Ž

#

Added the compile timings

#

Note that this isn't a complete trace as it errors out, but it shows that it's using one thread per file! ๐Ÿš€

left night
#

very cool

glad urchin
#

i do wonder how much % of a performance boost this might bring in practice, but i suppose this can be significant for eval-heavy crates

sturdy sequoia
#

so my hope is that by parallelising the compilation & the evaluation of dependencies of the "main" file, we might run into a scenario where the compilation is short enough to have a significant overall gain

#

the neat thing is: it's very easy to disable parallelisation if I want to to test its realy impact (and I don't mean -j 1, I mean in the compiler itself)

#

It gives a lot of room for exploration of:

#
  • Instruction repr (enum vs dyn Trait vs mix of both)
  • Parallelization of compile & eval
  • Instruction architecture (large instruction vs Bytecode by comparing VM3D to VM2)
#

Stuff like that

#

it's a more "interesting" VM than the previous two

#

But imports are still an ungodly mess unfortunately ๐Ÿ˜

#

Simply because there are two import paths: static (i.e I can know when compiling) and dynamic (i.e I can't know when compiling and resolve during eval)

#

Oh no, the VM is finding bugs in packages again ๐Ÿ˜ญ (namely codly)

sturdy sequoia
#

2238 passed, 394 failed, 0 skipped

#

Okay, that's enough for tonight

#

Getting there!

#

Just one final trace for tonight showing how it actually evals when it goes far enough!

#

And you can see how short the compile are compared to the eval for those that don't have huge dependencies (like the third green box from the top in the middle)

#

glob imports unfortunately greatly decrease this effictiveness since it requires waiting on the module to be evaluated (to be able to declare all of the imports)

#

but even in that case, we can benefit from parallelism

#

just glob imports = bad, good to know for codly, cetz, etc. if this thing ever gets merged

glad urchin
sturdy sequoia
#

but for glob imports, I need to create the list of variables to declare them

#

and since I just have a * instead of a list of names

#

I have to busy wait for them to be declared ๐Ÿคทโ€โ™‚๏ธ

glad urchin
sturdy sequoia
#

so:

  • #import "codly.typ": * => slowest
  • #import "codly.typ" => fast
  • #import "codly.typ": a, b, c => fast
  • #include "codly.typ" => fastest
sturdy sequoia
#

so I don't think the gains are that huge and we still benefit from parallel compilation mostly

#

one place where we always 100% benefit from parallel compilation is #include "file.typ" since (if the file is constant), it always resolves to a deferred constant!

#

Which I think is amazing, that means that a project like my thesis that does one include per chapter gets parallel eval "for free"

#

So if we assume that my 10 chapters have the same eval time, that divides the total eval time by > 10x (since I assume compilation + eval is faster than just eval which was the case for VM1 and VM2)

#

I am kind of hoping that for larger projects that leads to a 4M decrease of eval time where 4 is the fixed speedup from using the VM and M is the number of chapter files

#

note that it's eval time not compile time

glad urchin
#

yeah thats one of the reasons why i was wondering about real %

sturdy sequoia
glad urchin
#

layout can often take much much more time, especially with state and stuff

sturdy sequoia
#

Yes, but think about it this way: every time you re-eval part of the doc to update things (due to state and stuff), you get a significant speedup

#

because the compiled bytecode is shared

#

so larger documents with many introspection loop iterations should benefit most (and eval heavy docs in general too)

#

As well as in watch mode: any closure, etc. should re-use the previous result

#

(that is not implemented yet however)

glad urchin
#

The parallelization factor will definitely be critical here

sturdy sequoia
#

The Thesisโ„ข๏ธ compiles!

#

There are still bugs

#

but IT COMPILES

proud sandal
#

Woooo

sturdy sequoia
#

A flat 15% improvement on the thesis with zero optimizations so far

#

(cold compile)

#

Memory usage is about 10% lower too!

#

About 10% incremental too so far

#

(and again that's on masterproef which is not an eval-heavy doc)

#

Time to try the raytracer

#

well it doesn't compile (yet)

sly pecan
sturdy sequoia
proud sandal
#

Dherse checking his features:

#

๐Ÿ‘๏ธ ๐Ÿ‘„ ๐Ÿ‘๏ธ

feral imp
#

Alright, I'm not going to say, we merge at midnight, because you're too close to the finish line. But I will say, this is epic, and I'm glad we get to follow along your vm experiments.

I'm glad this time around, you're seeing some improvements.

proud sandal
#

We merge at midnight

sturdy sequoia
#

Even the only eval-heavy part (which is a diagram) looks good

#

2274 passed, 358 failed, 0 skipped
Tests are slowly improving but a lot of the remaining ones are span mismatches because errors have the wrong span, just takes tedious fixing unfortunately

sturdy sequoia
#

I also can't wait to test all of the different ideas I have to further improve performance, what I find reassuring is that compiling + eval is already quite a bit faster than old eval, but it can definitely be better

sturdy sequoia
#

2281 passed, 351 failed, 0 skipped
Progress!

sly pecan
lofty turret
glad urchin
#

(please do though cuz im very curious about the potential speedup there, it is very eval-heavy lol, though probs not as much as cetz)

feral imp
#

As always, I'm super duper curious about memory usage reduction... Cuz better performance is great, but if the memory is reduced greatly.. that's even better, even if perf gains are small.

untold turret
glad urchin
#

but whether i will be the one implementing it, or when , or in which way, that's not yet defined

#

what i can say though is that ive been working on elembic again

#

i want to publish it to the package manager soon (within 1-2 weeks)

#

so that will be interesting to see

#

will be of great help to see how far we can go with the idea at least, and see what would we want to actually bring to the compiler

untold turret
# glad urchin i mean, there is definitely a plan to bring its most important functionality (cu...

laurmaedje said it would give preasure to maintain a dashboard of plans, but I only would like to have some brief lists see how html export, customed elements, and other things on roadmap will go concretely. would be great even if we are updating the plan dashboards quarterly or even annually.

rust wg-async has some good list:
https://rust-lang.github.io/wg-async/vision/roadmap.html#overview

untold turret
#

This extends too much probably. I only wondered when I saw elembic again: how many steps we should take to achieve custom elements, and how many steps we have achieved.

left night
sturdy sequoia
#

2303 passed, 329 failed, 0 skipped
Fixing the issue I noted in #discussions (which was a VM bug) improved tests by 22!

sturdy sequoia
#

Just wanted to share some small performance results on masterproef:

  • -j 1 gives 15% improved cold compile times
  • -j 4 gives 15% improved cold compile times
  • -j 8 gives 12% improved cold compile times
    What I find strange is that I expected the speedup to increase with higher thread count due to parallel compilation hyperthink
    (there were run with hyperfine)
#

To be fair, the difference in initial eval time (before layout) is:

  • -j 1: 152 ms
  • -j 8: 135 ms
    Which is a surprisingly small improvement judging by the level of parallelism thonk2
#

It tells me something isn't going right with parallelism or the "joining" at the end of the compiler is very slow

#

could be both tbh

proud sandal
#

Oh no it's going to be vm2.0 all over again

#

/jk

sturdy sequoia
#

so there's definitely hope for it

#

but I don't understand why parallel eval doesn't give a better return

sturdy sequoia
#

Ok, I did some improvement to the joining logic which gives a 15% performance increase at -j 8 too!

sturdy sequoia
#

To my surprise, using a big enum does not really improve performance ๐Ÿ˜ฎ

sturdy sequoia
#

Like after some layout optimization I get about 16% performance improvement vs 15% previously

#

so the gains are tiny

untold turret
#

In most computers, the hardware will optimize virtual calls heavily and give similar overhead as making a big switch, i.e. both of them cost like a single branch instruction with knowing the branch target.

sturdy sequoia
#

What is still missing optimizations wise:

  • using PicoStr for arg & field names (super easy)
  • using Cow for values borrowed from the VM instead of systematically cloning (tedious but easy)
  • improving the repr of several instructions (i.e making them smol) and of the CompiledSource (hard-ish)
  • adding a new type of constant (globals) which are just &'static Value (easy)
  • adding the disabling of memoization for small function (especially now that eval is so cheap) (tedious)
  • adding the Readable::Move for values that are at the end of their lifetime (easy most code is there)
#

My goals are: a further 30-50% reduction in compile time (i.e AST -> Vec<Instructions>) and a further 30% decrease in eval time

#

at least those are my goals

#

My hope is that a doc like masterproef can reduce by 25% total compilation time (i.e complete runtime) and an eval heavy doc could be 2-4x faster hopefully. Don't have a sense for how much time it should reduce in incremental so will test later

untold turret
#

adding the disabling of memoization for small function
Is that the reason you started vm3d? I was not actually getting that we gave up and we are now making VM great again.

sturdy sequoia
sturdy sequoia
#

I also compared main with my fork on my Desktop compute (7950x3D): (time ยฑ stddev)

  • -j 1: 2% ยฑ 2% faster
  • -j 4: 9% ยฑ 3% faster
  • -j 8: 8% ยฑ 3% faster
    So the gains are much higher on my M4 Max MBP
#

it's also interesting that typst is much faster on my MBP than my Desktop

#

I have a feeling it's a memory latency thing

sturdy sequoia
#

The more I compare my MBP to my desktop the more I want to buy an M3 Ultra Mac Studio

feral imp
#

Alright. Are you comparing linux to macos, or do you have stinky windows on your desktop? Cuz, I can tell you... Windows sucks.

feral imp
#

Stop... Stop comparing, and go back to coding. Jesus Christ.... I have an airfrier in my kitchen that can beat any desktop with Windows on it..

sturdy sequoia
#

And building is so slooooooooooooooooooooooooooooooooooooooooooooooow

#

but I think it's more of a Windows issue than a my PC is slow issue

sly pecan
feral imp
sly pecan
#

๐Ÿ™…โ€โ™‚๏ธ

sturdy sequoia
#

Ok, so interning of arg names (at -j 8):

  • 7950X3D: 8% -> 13%
  • MBP: 9% -> 12%
    Note that the stddev is often ~3% so this is barely outside of the margin of error :/
#

At -j 1 on the MBP the gains are negligeable 15% -> 16% and on the desktop: almost none

#

But this is all on masterproef, gotta try on notes.typ, PgSuper's test doc, and raytracer.typ then do in-depth comparison

#

With enough simples to see the actual distributions

lost meteor
sturdy sequoia
#

Like typst is almost 50% on my MBP

#

And even more so with the VM

sly pecan
sturdy sequoia
#

Too lazy, especially on Windows

untold turret
#

@sturdy sequoia I have a problem on profiling tinymist using typst-timing. cli resolves spans after each compilation, but tinymist cannot do with the same way because it runs typst tasks concurrently. Speaking two worlds X and Y run at the same time and have different content (revision) of a same source file, where a span of the same raw number may point to two physical file location. The recorded span from X and Y of the source file will be mixed together and I cannot resolve them by the correct world. Any idea? Btw, I have applied typst-timing to the analyzers and it works really well. typst-ide could also do that I think.

sly pecan
#

oh I see it was proposed once but was problematic because of unsafe use

sturdy sequoia
#

right?

untold turret
sturdy sequoia
#

We could add a thread_local! for a task ID in Typst

#

But right now we do use thread IDs actually

#

Since we can see multithreaded traces from Typst rayon dispatch

#

so I am not sure what is the source of this issue in tinymist

untold turret
sturdy sequoia
#

Why?

#

But we could add a task_id in addition to the thread, but I don't think that would really solve the issue

#

I guess we could do one recording per "task"

untold turret
#

For example, I compile thousands of docstrings in parallel and every docstring's main has file id main.typ

untold turret
#

My current solution is simply try resolving otherwise giving up. This is not good but does almost work.

sly pecan
sturdy sequoia
#

#1176509648707256370 message

sly pecan
#

There are some benchmarks there, but I didn't read it because LLM walls of text make me want to rip my eyes out

sturdy sequoia
#

Is it an LLM wall of text?

lofty turret
sly pecan
sturdy sequoia
#

Ok fair

#

But like all he shows in the benchmark seems to be amortisation of bigger function calls?

#

Is it something we didnโ€™t already know?

sly pecan
#

I think the main complaint here is memory usage. Not sure exactly what they claimed speedup is

sly pecan
#

Anyway I could swear there was an issue, but I guess it's just been discussed to death here on discord

lofty turret
left night
#

I didn't get the impression that it is written by LLM tbh

#

Either way, they did quite a bit of measurements, which I can appreciate

sly pecan
#

I'm like McCarthy but for LLMs instead of communism

left night
#

As some of you have also experienced, it's quite tricky to benchmark the Typst compiler because repeated runs have high variance in wall clock times (sometimes between 5-10%). This makes it really tricky to make any kind of statements about performance improvements or regressions as very few changes have such drastic impact.

This was really quite frustrating. So I looked at what the Rust project does to benchmark rustc and tried to replicate it by

  • renting a dedicated Hetzner server where we have 100% control over the OS
  • tweaking it to reduce noise as much as possible (no ASLR, no hyper-threading, core isolation, dylib interposition to override getrandom with a deterministic version); most of these tricks directly follow what Rust does
  • built some infrastructure to perform benchmarks and compare measurements (I've had some fun by building the comparisons via Typst-powered server-side rendering)

It's all still quite rough, but the results are very promising. Below, there are two runs of the exact same binary. As you can see clock time varies a bit, but the instruction count is really almost exactly the same. (Here, the difference is literally just 1. Typically it's a bit more, but not more than ~0.0005% variance.)

mortal bough
#

Is this something that will eventually be part of CI in some way?

left night
#

And if it should run on user PRs, I also need to make sure that it doesn't compromise the server

sturdy sequoia
# left night

Does it include export in different target? ๐Ÿ˜„

left night
sly pecan
#

@left night cooking something in comemo-prescient?

left night
# sly pecan <@311948531835469827> cooking something in comemo-prescient?

comemo-prescient is a testbed to check what the theoretical limit for optimizing comemo is. it runs the compiler twice, first recording the exact the hit/miss sequence and the second time, it already knows exactly how it's gonna play out and does not need to do any comemo tracking (which has significant overhead).

sly pecan
#

what do you mean by hit/miss sequence? As in what is actually reused versus not?

left night
#

which calls are a cache hit vs not. in reality, comemo has to do a bunch of stuff to validate whether there is a cache entry that is a hit. in this branch it already knows whether it will be a hit or not as soon as it enters a memoized function call.

on main, the validation can take quadratic time (that's the issue). I've implemented something that can validate it in linear time with a tree data structure. this makes incremental somewhat faster on average and much faster in the cases where the issue comes into play, but it slightly regressed some clean compiles and I was trying to figure out why.

sly pecan
#

That'e exciting. I think incremental compilation speed is more important than cold/clean, but regressions are obviously not ideal anyway

left night
#

interestingly it seems to be a slight regression in instruction count, but not in wall clock time

#

at least it appears that way. the thing is, wall clock time is noiser so it's a bit harder to say for sure

#

it could be that the instructions used in the one case are more expensive

#

but optimizing for instruction count is much nicer as it's super stable

sly pecan
#

This could differ from platform to platform too no?

#

based on what the bottleneck is

left night
#

it could probably

#

but optimizing for the benchmark server is good enough for me and the rest will behave reasonably similar I think

#

how much all of this caching stuff matters also highly depends on the document

#

if it's mostly text, it doesn't matter much; if it uses a ton of introspection, comemo overhead starts taking up a fair chunk of the compile time

sly pecan
#

As the package ecosystem grows and you have users blindly using a ton of them it's likely that even seemingly simple documents may have a lot of introspection

left night
#

yeah

sly pecan
#

I've seen some...creative solutions

left night
#

at least, with the changes I'm implementing, performance will not randomly degrade quadratically

#

if you have some contextual construct that appears a few hundred times in your document, that can already turn quadratic cache validation into a significant amount of the incremental compile time

sturdy sequoia
sturdy sequoia
left night
spiral current
#

(Not sure if this is the right place) We have Rust code that uses Typst as a library and we recently implemented generating multiple pdf files in a loop from the same Typst code, with a different JSON file input for each iteration. We noticed that each iteration compiles significantly slower than the previous one. It turns out that adding comemo::evict(0); after each iteration fixes this, so it looks like Typst slows down once the cache grows beyond a certain threshold. Do you think this is related to what the quadratic cache validation posted above, or some other issue?

left night
spiral current
#

Sounds good, thanks!

sturdy sequoia
#

@left night Congrats on the performance improvements, on masterproef with -j 8 I am seeing around 4% improvement on my MBP, and in -j 1 I am not seeing improvements. This is on cold, but it's cool to see. I'll try watch mode later because I am going to the gym in five minutes ๐Ÿ’ช lol

left night
#

I assume the 4% are from faster hashing

#

In clean compiles, that is. In watch mode with -j 1, it was 15% faster.

sly pecan
#

I'm guessing the exact numbers will depend on hardware

sturdy sequoia
#

I feel like masterproef is too text heavy to benefit from these things as much as other more recent documents. In a way masterproef was written when typst was a lot more basic than it is now and with those limitations in mind (remember the whole 27 GiB of ram to compile thing ๐Ÿคฃ)

#

So in a way I donโ€™t find it surprising.

#

To be fair many of the features I contributed and other improvements to performance were tailored for masterproef so maybe it has just become a document for which typst is very heavily optimised.

sturdy sequoia
#

@left night I did test watch mode and it does seem even 20% faster on masterproef! Really congratulations ๐Ÿš€ I can't wait to test it with the vm3d too, I need to finish updating it sweatwipe

sly pecan
#

Claims to be much faster than rayon

proud sandal
#

Probably not, I don't think that rayons overhead currently contributes enough to warrant a switch

wanton meadow
#

something that competes with rayon sounds very interesting in general

proud sandal
#

The benchmarks seem interesting, they also noted that they don't use container-specific implementations, which is why some container-specific rayon implementations are necessarily faster. So we'd only benefit from a switch in cases where rayon itself is the bottleneck, as I understand it. My understand is that, since layout can be fairly expensive, the rayon overhead should be negligible.

sly pecan
#

Maybe some things that aren't currently parallelized because the overhead would be too big?

#

I don't know. Just thought I'd mention it here

proud sandal
#

I think trying it out is definitely a good idea

sturdy sequoia
#

Trying it out is definitely worth it imo

#

But I don't think this'll make a huge difference

#

perhaps some of the Deferred<T> will be a bit faster with less overhead

#

but that's about it

sly pecan
sturdy sequoia
#

@silent I think that crown goes to @left night now with the comemo changes

#

nooooooo

#

I heck-ed it up, sorry Laurenz meant to silent ping

sly pecan
#

You woke the sleeping dragon

cunning wadi
#

Could perhaps be worth a shot

#

(Chromium and a few other well known projects use the C++ version of the same hash function)

buoyant lark
cunning wadi
sly pecan
cunning wadi
sly pecan
#

Doesn't typst use 128?

wanton meadow
#

it might be rustc-hash that should be the item to look at (for current typst git hasher) in the benchmarks from that github repo I guess

cunning wadi
#

And with the comemo thing mentioned in the PR, I'm not sure it uses 128 bit hashes anywhere now

cunning wadi
#

(Example: images)

buoyant lark
#

I guess laurmaedje is the only person that really has the tools to analyze the real-world performance impact of this currently and would have to be the one to make the choice ๐Ÿ˜…

looks interesting certainly!

sly pecan
#

Based on what I can tell (not a rust dev), comemo is still using 128 bit hashes

wanton meadow
#

the Rust stdlib interface for hashing only works with u64 final values anyway, so that's also what a Rust HashMap uses, a 64-bit hash. The internal state of the hash algo could be bigger of course

#

siphash's internal state is 128 bits (Rust's default hasher)

#

oh 128 bit is the key size the internal state is bigger

#

It's hashes in two different places, that explains the collision of concepts

pub struct CacheData<C, Out> {
    /// Maps from hashes to memoized results.
    entries: FxHashMap<u128, Vec<CacheEntry<C, Out>>>,
}

The u128 value is a hash. The hash will be (64-bit) hashed to find its slot as a key in the hash map. That's the two layers.

cunning wadi
wanton meadow
#

I don't know the details but if the u128 value is already a good hash value, wouldn't the trivial hash function (truncate 128-bit to 64-bits) be faster and so on?

#

that's for hash map performance only

left night
#

siphasher is for resistant hashes, rustc-hash for fast collision-is-ok hashes

sly pecan
#

I can't see that being automatic

left night
#

but regardless the rustc-hash is also used in a few of other places in comemo

cunning wadi
wanton meadow
#

we think that siphash has very nice properties, it should be able to be truncated as you want

molten kayak
left night
#

#1176509648707256370 message

#

the answer is: nobody did it

untold turret
#

needs some testing. nohash-hasher sometimes makes accesses to hashmap slower for bad distribution but worth a try.

wanton meadow
#

in this case the u128 is a siphash output so it should be at least as good as the hash was before (the change to rustc-hash)

left night
wanton meadow
#

working on it

wanton meadow
left night
#

ok, benchmarks are queued, will take ~10-20mins

wanton meadow
lunar kettle
#

is that benchmarking tool typst agnostic? could imagine that there are other uses cases where this could be useful ๐Ÿ˜„

#

say for example to benchmark a PDF renderer ๐Ÿ‘€

left night
#

currently, it is fairly Typst specific

#

but it could be extended I guess

#

the denoised setup is not Typst specific, just the infra around it

left night
wanton meadow
#

oh ok, that was quick response. Did we do something wrong? It also replaced all the FxHashMaps except for this one: edges: FxHashMap<(InnerId, u128), NodeId> because it's the one with more than just u128 as a key

left night
#

the diff looks correct to me

#

probably, it's just not a bottleneck anymore, at least for Typst

#

that raw gains in comemo could be different, though they're probably also not that large because comemo overhead is a considerable part of Typst compilation in some of the benchmark documents

wanton meadow
#

aha, now we know... you don't have any microbenchmarks in the comemo repo either? Then you can't be misled by them ๐Ÿง 

left night
#

no, there are no comemo-specific benchmarks

#

but it's also not used by anybody notable except Typst afaik

wanton meadow
#

if fxhash was a 2-5% win; i'd have a priori expected a similar or maybe 2% win with this too. I spot missing inline directives and the only thing that might have ruined the first try.
Is there a benchmark for typst or comemo I can run locally to verify a difference?

left night
#

Isn't #[inline] within a single crate not really needed anymore

#

But it's no problem, I can start another run if you'd like

wanton meadow
#

I'm not sure, because for example CallSequence is generic, isn't it monomorphized into typst somewhere, and then it still becomes cross crate(?)

wanton meadow
#

just running it seems to be the easiest way to know

sturdy sequoia
#

Well I mean, even then, using a passthrough hasher surely can't be a bad thing, right? It's not like it matters in the grand scheme of things?

#

(code/binary size wise I mean)

#

And I would expect the gains, small as they may be, to increase in watch mode

#

And I don't think the current benchmarking harness Laurenz has made does watch mode?

#

I wonder if a small one could be declared to avoid dependencies: one that just hash a u128 and turns it into XOR-ing the u64 pieces? ๐Ÿคทโ€โ™‚๏ธ

wanton meadow
#

the WIP one I did is just a file, it's like 15 lines of code and it truncates the u128 to 64-bits. So basically it's done like that already

#

would be more satisfying to be able to solve this one too FxHashMap<(InnerId, u128), NodeId> but it can't be a passthrough anyway so maybe it's not worth more custom code

#

(InnerId is a usize)

left night
#

the #[inline] annotation shows ~0.000% difference

#

but thin-lto is also enabled on release builds

wanton meadow
#

ok ok. thanks a lot for running it

#

thin lto seems to be enough to enable inlining anyway then (with codegen-units=1)

left night
#

fwiw, I am not per se against this change even if the gains are little

left night
#

also, the benchmarking harness uses only a small amount of documents

#

I'd kinda need a better collection

#

I was considering to ask people for them, but at the very least, it'd need some clarity on licensing

wanton meadow
#

as always would be good to understand what's going on. When I'm looking in typst/typst (git) there are zero matches for "benchmark". It would be nice if there was a hint where to look for existing benchmarks, or which testcase to use

mortal bough
#

Could you use random documents from users of typst.app?

sturdy sequoia
left night
sturdy sequoia
#

And would be wildly problematic imo

left night
mortal bough
wanton meadow
#

what's the current pointer towards benchmarking if we want to do some experiments, find our own documents to test?

sturdy sequoia
#

but definitely not opt-out

mortal bough
#

There is already a Telemetry setting

#

It could be made document-wise

left night
#

we can access document contents (both to support people and also because building a collaborative encrypted online document editor is actually super hard; I know of basically only one, by Proton), but do so only to support people and enforce the ToS

#

if it were encrypted to the point that we couldn't access it, we couldn't run benchmarks either, so I'm not sure I understand the idea

#

but anyway, I don't think it's necessary

#

just a few representative documents should be more than enough

sturdy sequoia
#

Necessary? No, I don't think so either

left night
#

I just want a public repo with a few test cases and with proper licenses on the documents

sturdy sequoia
#

I know y'all use at least masterproef in it hรฉhรฉ, still the document for which Typst is the most optimized ๐Ÿ˜ˆ

sturdy sequoia
left night
#

masterproef repo doesn't have a license either, so I could not commit it and am probably already in violation for using it :(

sturdy sequoia
#

I need to upload one

#

or re-upload one

#

As far as I know the content is under some Creative Commons license by the uni

left night
#

and unfortunately I don't really have the time currently to build it into a proper system usable by anyone

wanton meadow
#

I was thinking it doesn't need to be as good as your main benchmarking setup. But if the repo of documents doesn't even exist publically, then that's how it is now

left night
#

4/5 are public basically

#

or rather, can be

wanton meadow
#

a benchmark corpus should preferably be stable so that you can compare versions. Then you need to port a little bit if there are breaking changes

sturdy sequoia
#

Ok @left night crazy idea: spin up a QEMU emulated VM, and count instructions of just typst, that way it's a "pure" instruction count? Doesn't take into account the effect of memory layout - that we know are important, see Content Rework 2 - but at least it should give a rough idea. ๐Ÿ˜‚ (it's horrible I know)

sturdy sequoia
left night
#

Then masterproef

#

And then one math-heavy document that I cannot share

#

That's it currently

left night
sturdy sequoia
#

I have some slides I could share, I'd need to ask for permission but they have been presented outside of the company quite a few times before at events/conferences so it should be fine. (I'll ask on Thursday). They're nice and slow because of introspection catwow

sly pecan
sly pecan
left night
sly pecan
#

The difference in performance seemed very substantial, but I guess it's very situational based on what the limiting factor is

thick solstice
#

How powerful is the benchmarking server?

sturdy sequoia
sturdy sequoia
sly pecan
sly pecan
sturdy sequoia
left night
#

I just want to be able to compile it and commit it to a repo

left night
#

I think it could only be applied to a very small number of structs in Typst

sly pecan
left night
#

Currently it's not worth the effort but at some point we could have separate wasm blobs for newer and older browsers.

sly pecan
#

I don't think a single browser supports it yet anyway

sly pecan
rigid latch
#

Oh, speeding up the equate package would be so great ๐Ÿ˜•

sly pecan
#

How much time is spent on shaping in a typical document? Asking since harfrust is supposedly like 3 times faster than rustybuzz

left night
#

Because I have literally today gotten an email from Behdad where he was inquiring about how much shaping is a bottleneck in Typst.

#

Copying a part of my response here:

  • On a long document that consists only of Lorem ipsum, just the shaping with rustybuzz (excluding shape plan creation) made up ~25% of total compile time (including PDF export which is about 1/3 of compilation there). For previews, PDF export is skipped, so there it would be a little more. This was with naive first-fit line breaking. With optimized line breaks, it would become less, at least as long as the line boundaries are safe to break.
  • On a document that uses more of Typst's other features, shaping quickly becomes a much smaller factor. On one thesis document I tried, it was only 3% of compile time (still with rustybuzz).
sly pecan
#

That should still be a decent performance boost for text heavy documents. Maybe also memory usage if that's improved?

#

But not game changing

sturdy sequoia
left night
sturdy sequoia
sly pecan
#

Harfrust is now only 16% slower than Harfbuzz based on the geometric mean of the latest update in their spreadsheet

#

Compared to rustybuzz which is like 3-4 times slower

sly pecan
minor kelp
#

Was there a behchmarking script somewhere or am I misremembering?

buoyant lark
#

IIRC the benchmarking stuff has more to do with steps towards consistent results than specialty stuff around Typst itself: #1176509648707256370 message. I don't think anyone outside of the team currently has that set up, but in the past they've helped get various patches tested #1176509648707256370 message

for less critical stuff / initial testing you can probably follow standard practices.

minor kelp
#

The first link is probably what I had in mind. Well at least now I know what it was about, thanks

buoyant lark
#

I mean there is also this #1176509648707256370 message (right below you can find some documents that are used for it), which I guess fits "benchmarking script" better, but its just a hacky interface to the former ๐Ÿ˜…
There are a couple of things users have tried in this thread but I dont think those have been really validated yet.

Wishing u the best with your shenanigans

spiral current
#

Just rechecked our use case (#1176509648707256370 message) with Typst 0.14 and Comemo 0.5, unfortunately the issue still exists so we still have to evict comemo cache after each iteration. There is noticable increase in memory usage in both debug and release builds, and in debug builds the pdf generation still gets noticably slower after a few dozen iterations.

left night
spiral current
#

Yes I can create one (but probably not very soon)