#Performance
1 messages ยท Page 8 of 1
a global list is probably the only way
yep
the question is, when the user writes a string that is in the global list, how do we check that it's in the global list?
by hash I assume?
my beloved
though rustc I think just dumps it into the normal interner hash map in a Lazy
For performance, I also wonder how much adding a "flatten" pass to the AST would bring
it's tricky with closures
they keep around a strong ref to part of the AST and might outlive the source file (in the comemo cache)
Ah, I see
What would you be thinking for that?
I've thought about ways to impl flattening just a little and haven't felt confident in anything
using IDs instead of storing nodes
and having all the nodesw in a bi array
lol. Almost had me getting flashbacks to implementing heaps for algo class
could you imagine, a binary tree for ASTs
that sounds kind of cursed
It would either have godlike performance or completely destroy your cache hit rate and I'm not sure which it would be
One of the two for sure ๐
@sturdy sequoia I've come up with a fairly nice approach for compile-time constant PicoStr. :)
The idea is as follows:
- We increase the size of PicoStr to 64-bit (PicoStr = NonZeroU64)
- We encode strings of length <=12 with the charset
abcdefghijklmnopqrstuvwxyz-1234directly in the number - Longer strings can be compile-time interned by adding them to an exception list
- Runtime interning is also possible, as before
Compared to just listing all strings, the benefit is that most strings can just be written directly without touching the exception list. The small downside is 64-bit instead of 32-bit.
Here is an initial gist with an implementation. All of the decoding and exception handling is 100% const fn, no macros!
Notes:
- I tried various exception lookup mechanism, but perfect hashing in const fn was a little much, so I opted for a simpler binary search.
- I also tried more efficient number encodings (Huffman coding, Arithmetic coding), but they didn't really give much of a benefit over a simple 5-bit encoding, so I skipped that as well.


@left night Do you have an idea on how many built-in vars/fields/etc. that covers?
I would assume most of them?
How do you distinguish between interned strings and exceptions?
index < exceptions.len() => exception
index > exceptions.len() => subtract exceptions.len() and use
I see, the 4 top bits
Have you tried swapping the current impl with it and see if it degrades perf? (beyond any additional addition of strings)
I haven't, but I doubt it would affect anything much, since PicoStr is only used for Labels
Which are somewhat hot, but 32-bit vs 64-bit shouldn't that big of a deal
BTW for perfect hashing, there is a crate to generate it for you
But it doesn't work in const fn
Do you want me to try (after work ~1 hour) to see how it impacts fields too?
Probably best if I first make a PR, so that we don't need to integrate it twice
And then you can do some experiments :)
Though I guess it's mostly copy-paste
I didn't really write in Typst, it's a playground crate
My main motivation, tbh, is to be able to write struct HtmlTag(PicoStr) and have HtmlTag::H1 as a const ๐
In the meantime, I am working on my own cursed creations:
what the heck
๐
You have no idea how cursed code I am doing for work is
but it's fast ๐
It's for C++ <-> rust interop
It's quite limited, but super nice
If you're willing to walk the mile, you can do a ton of stuff
Yes, you can even do some basic hashing, which I am planning on using in my project to test if I need to regenerate files, etc.
Kind of like comemo but between invokation of my own cargo subcommand
ah interesting
I did write a mini-hashmap in earlier iteration of pico.rs (~50 lines of code), but I think it would have been slower than binary search
because it wasn't perfect hashing and linear probing could've taken a bit
it's true that it's linear probing of pointers :/
not of the direct values
(did someone say inline byte array with \0 to delimit strings?)
can build that in a const fn from the array!
the interesting part to me is that you can kind of by dynamic, but you need to do it in two steps
you can build an array of dynamic size by first computing its size (one const fn to sum the strings length + 0 bytes) and then another one that returns a [u8; THAT_LEN] which actually builds it.
it's interesting how you need to split it
you can do some nifty stuff indeed, not as powerful as zig unfortunately
yes, but more powerful than I thought previously
at work my solution is the following:
- have proc macros that hide functions in the user's final library
- generate (using rustc) a tiny executable that calls all of those functions and write to a file the serialized output
- read those outputs
- generate c++ code
- bundle the rust library with the c++ code
- ???
- Profit!
It allows me to get information you normally would not like the size, alignment, etc. of structs
It also fully works with feature flags, etc. ๐
kinda like wasm-bindgen I think
Probably, I took inspiration from a crate for building PostgreSQL plugins
Before I used a heap of build script and it was... messy
FWIW the newer rust versions are getting lots of new const functions
mostly cuz &mut will work in const in the next rust version
iirc
and together with the new (already released) const { } syntax, const stuff in Rust is gonna go boom
So good
I'm curious what people will be able to do with it
Next, I want proc_macro_span to be stabilized ๐ญ
With fields we need to be somewhat careful. We don't want to accidentally intern a ton of dynamic field accesses but we want fast comparison.
Yes of course
How would you approach it?
I donโt know yet 
PR is up: https://github.com/typst/typst/pull/5491
We should probably fuzz that btw
Don't question my code, it's flawless ๐
me a few days ago after i got called for the production segfault
I wonโt do it again ๐
jk, it segfaulted in tests
We do have that fuzzing harness
So we can just let google do the fuzzing
isn't it lovely? that's all I wanted.
beautiful number comparisons at runtime
I guess it could've been an enum, but this is so much nicer
btw, did you guys know that html attributes may contain whitespace like the paragraph separator?
That's indeed super cool ๐ฎ
what do you mean?
like class = "Hello, World"?
I mean my-attr="hello", but with the - being replaced by U+2029 PARAGRAPH SEPARATOR
:(
BTW for PicoStr, do you think we could use a faster hashing function?
would make interning a tiny bit faster
Also, using a const-constructable HashMap could avoid having the LazyLock which would likely make it faster too
yes, both things make sense
though I'm not sure how much dependency cost I'd be willing to pay to get them
I mean for the hashing, I think we already have some faster ones in it, don't we?
And for the HashMap I'd have to look, I don't know what's the state of const hashmaps
actually std hash map is const if you provide your own hash builder!
it's probably just not const because of RandomState
But for this, we don't care about ahving a random state?
I don't think this is DDoS sensitive
Maybe I'm wrong
agreed
So it might be relatively easy overall
if you want to ddos yourself, go ahead and write range(10000000)
I know and I don't care ๐
cause DoS sounds dumb
You'd probably like it if it was D-o-S.
@left night couldn't PicoStr contain an inline string or a static pointer to a string?
that way resolving is faster?
using the LSBs to differentiate
since it's 8 bytes now anyway
Make the eq impl between &str and PicoStr trivial
No need to lock when resolving a pointer that way
The static pointer cannot be known at compile time
Preventing const fn
Maybe we can only use indices for the exceptions and a raw pointer for runtime. Though LSB doesn't directly work because of string's alignment being 1.
Endgame is building all conceivable documents at compile time. Truly instant preview ๐ง
ah shoot
duh!
We can even compute the derivative of an arbitrary function at compile time now
This macro handles automatic differentiation. Automatic Differentiation macro which allows generating a new function to compute the derivative of a given function. It may only be applied to a function. The expected usage syntax is #[autodiff(NAME, MODE, INPUT_ACTIVITIES, OUTPUT_ACTIVITY)] where: NAME is a string that represents a valid functio...
If this shit ever ships to nightly, so it can actually be used without rebuilding rust, then and only then, will I entertain it's existence.
but it is in nightly
isn't it?
There's even a crate using it
There are still PRs open.
VM comeback?

Not sure how applicable it really is to Typst because a large part of Typst execution time is in the "runtime" itself and not in actually running specific instructions. But it certainly looks good, even a bit too good to be true.
Dherse nightmare / dream
I think JIT makes sense if the API were massive, but as it is, a user only imports what is needed and so it probably most codepaths will be run. The only exceptions are those behemoth packages like cetz, and it would probably massively benefit them, though there are easier ways of handling that currently (splitting up packages as much as possible)
I think, in general, that packages like cetz will already benefit from the VM itself
huh i recognize the authors
they also wrote the infamous copy&patch compilation paper
it seems this paper builds upon that one
TL;DR?
they use so-called "stencils" to stitch together code snippets from a vast library of binary snippets which during code generation gets their missing values filled in
and they provide a system for generating these binary stencils
ah cool, that's what the new python jit uses
I learned that IDs can be plain numbers recently, but if you want to style them from css you need to escape the first character ๐
Found this neat paper about how they tricked out a tree walking interpreter for R with a bunch of tree optimizations and beat the standard bytecode vm
https://dl.acm.org/doi/10.1145/2576195.2576205
dherse in shambles rn
impressed more than anything
@sturdy sequoia I did a small performance experiment and I thought you might be interested in the result.
What I tested:
- Normal Typst on a few small to medium sized documents.
- A customized version where the global allocator is just a bump allocator. Essentially trading off memory for the theoretical optimum in terms of allocation (and especially deallocation) overhead.
- A customized version of Typst with Mimalloc.
Interestingly, the bump allocator was only ever so slightly faster than the normal one, but Mimalloc was a fair bit faster than the bump allocator!
I can see two potential reasons for that:
- Mimalloc has support for realloc, which is probably quite relevant for vectors, or really any growable collection
- IIRC Mimalloc has thread-local free lists, whereas the bump allocator was one global atomic, so it might have more sync overhead
I do wonder which (if any) of the above is the primary reason. Either way, very impressive by Mimalloc.
I had tested mimalloc previously and it generally provides a nice speedboost, I would assume that it's because it's also nicely made for parallel applications (iirc). Realloc does make sense too but I don't know how much realloc-inc rust does by default
I remember long ago I could get a 15% uplift just with Mimalloc, for a long while I ran locally a custom Typst with MiMalloc lol
I would assume that all std containers use realloc.
Until a few weeks ago, I never realized that doubling a big vector in size doesn't even necessarily need a large memcpy: Even if the memory is fragmented, the allocator could just remap the virtual memory! That's something that fundamentally distinguishes native from WebAssembly.
Why would you call your allocator memealloc?
because everything Microsoft does is a meme
Mind you, for once, Microsoft did something right
MiMalloc is awesome
Ugh, that's true but not necessarily obvious
although I would think that in the std code, they don't handle the realloc themselves
does memcpy check if the src and dst addresses are the same?
I'm not sure, but why would it matter? If you realloc, the contents get copied over by the allocator after all.
what do you mean?
In the code of the collections themselves (outside of the allocator)
oh, it does? ๐ฎ
Then yeah it doesn't matter indeed
The collections just do alloc::alloc::realloc
damn
so they do
fascinating
and I suppose mimalloc is probably also tuned for good realloc perf
maybe it leaves some gaps and fills them in later
idk. I did skim the mimalloc paper recently, but I don't remember anything to that effect.
The collections doing realloc makes sense, since the allocator doesn't know when an allocation is to be reallocated vs newly allocated, but the collection does
I do remember that it allocates in size classes, which kind of goes against reallocing in place
When I wrote my own collections in C, I manually called free and malloc/calloc
I am dirty like this
back then I didn't know about realloc
You slut /s
why u hurting ur ram like this?
I have a specific episode of Rick and Morty that comes to mind everytime I hear/read that word
Sorry, I didn't know better ๐ญ
don't worry, Typst has already hurt your RAM worse than your own collections ever did
Iirc correctly, typst used > 28 GB of RAM to compile my thesis back when it used the first iteration of my glossary
which BTW, all glossaries on Typst Universe are descendants of that glossary
I was ahead of the curve ๐
Honestly, RAM usage is fine, as long as you have tons of it ๐
You just need to avoid using cetz in presentations ๐ญ
RAM usage (although I've not seen recent benchmarks) is one pretty big weakpoint.
Couldnt reduced memory usage conceivably even improve performance because of memory bandwidth?
Depending how it's done of course
It most likely would indeed
If nothing else but also because there would be less stuff to hash, less stress on the allocator, fewer entries in the comemo caches, etc.
I agree, but memoization is what makes Typst so incredibly fast, and I don't see an immediate way of improving memory usage
I think overall we could allocate less
And there are optimizations I have previously suggested that we could use in comemo
(mostly disabling memoization when the call stack is deep enough)
But these are tricky things
Also, I am sure we could also optimize a lot of data structures to be less memory intensive
but the problem there is that profiling allocations is tricky in my experience, I haven't found a good way of doing it yet
That's why I want to try the one on macOS, maybe it can tell us more
Or a custom version of the CLI tool that uses a tracing allocator of some kind
I see. Man, it really is intricate stuff. I wish anyone working on this nothing but good luck, and fortune.. Hopefully all your attempts to optimise yield results..
a single BlockElem has std::mem::size_of ~1KB. and we have a ton of them. I absolutely hate it, but it's not trivial to make elements less heavy.
I think most (all) fields shouldย probably be behind a pointer. Then we also get the None case for absence for free
but I have no clue how much these huge element sizes and all the cloning of them really affect performance
hard to know without fixing it in the first place
That's perhaps the one bad consequence of the Content rework
It was obviously worth it overall but it did balloon the memory usage ๐ฆ
yeah
what about make block share data by shallow copying each others? I was impressed by scala using it to allow my compiler to edit ast cheaply.
most copies are already quite shallow through lots of Arc
or do you mean something else?
sounds like most fields of block comes from style chain (probably). it is natural to copy edit the element from some value in the context.
not exact, arc copies on change, but shallow copy creates new instance by referencing some existing value and attaching changes. scala may have some technique to not slow down reads with using shallow copy. I didn't learn much about the underlying implementation.
I see
Ok I might have a bit of a cursed way of solving it
that would likely work
Essentially, have two repr: dense, and full:
- dense: only contains fields marked as
#[dense], they are always present - full: all other fields would only be included in this repr if they are set
That way, any elements that doesn't have all of its fields set will use thedenserepr, using less RAM, if they have more fields set, it uses thefullrepr. We can trivially turn one into the other if needed, and we can use a single bit somewhere inContentto do the switch
That actually sounds pretty doable, curious how much it would improve performance
On the dense impl, accessing one of the field always accesses the style chain
If the element doesn't have any #[dense] fields, then it is always full
required fields could also be the selector for dense or not
Since everything uses dyn Trait under the hood, the Construct impl can return either one and nobody will ever know
Of course, that would mean that places where we create BlockElem (for example) would need to either use a BlockElem or a DenseBlockElem based on what they need
Then for the impl for T where T can be dense, we can have a dense proc-macro that duplicates the impl with the second version
(just copy pasting the code, swapping T for DenseT)
For example in raw.rs, this:
realized = BlockElem::new()
.with_body(Some(BlockBody::Content(realized)))
.pack()
.spanned(self.span());
Would become:
realized = DenseBlockElem::new()
.with_body(Some(BlockBody::Content(realized)))
.pack()
.spanned(self.span());
As far as all the field enums, etc. they can just be shared
I would not use this for all elements, but for the really big ones that are constructed often (block, box, heading, etc.) that often use the style chain. By doing some experiments, we can determine which fields are good to be #[dense] and which ones are not worth it.
Still a bit of manual work, but mostly automated!
Duplicating elements sound like a bit of a pain to deal with and also like a compile time hit. Generally speaking, I'd also like to reduce the amount of proc macro code generation rather than increase it.
For user-defined types it seems more feasible.
There's also the problem that showing an element materializes all fields
So it's always fully populated then
From what I can tell, the big ones are block, box, heading, grid, table, pad, place, bibliography (although there's usually only one so not worth it imo), enum, figure, figure.caption, footnote.entry, list, outline (but usually only one), outline.entry (not sure it would be worth it), underline et al. (they're huge), curve, line, path, polygon, rect et al., image
It does? ๐ฎ
Or by showing you mean in a show rule?
Perhaps just using the same technique as par by using more ghost fields on the big ones?
So that you can access the fields in the show rule
The semantic pars will get rid of the ghost fields

Generally ghost fields mean that the fields will not be user accessible in a show rule
That's not good
I mean they're still used in quite a few other elements
I kind of liked the idea of dense and full repr
I think it is pretty clever nonetheless :-p
I think we can potentially repurpose it, though in a bit different way
I was writing something about if we had another variant that held a point to an existing full instance and then contained a list of local changes to fields until I realized that that was effectively what the style chain is already ๐
Where we always only have the required fields plus a local dynamic container and that dynamic container is an immutable data structure shared with the style chain
Style chain would become owned and pre-fold instead of folding lazily
That way materialization is typically just an Arc bump
and we can potentially get rid of the accessors depending on the style chain
Ah I see, and when you set block(spacing: 10pt) on the style chain, it updates the structure in the style chain?
Then when we realize, we just store an Arc to the style chain's copy?
What does "pre-fold" mean?
So if I do:
set block(spacing: 10pt)
........ (imagine more code here)
set block(stroke: 1pt + black)
I would have two structures on the style chain:
- one with just the spacing
- a second one (deeper in the chain) with both the spacing and the stroke?
more or less. basically it's just a dynamic set of properties and uses the same data structure as the style chain
And we fold eagerly on the style chian?
That actually sounds pretty swell
the style chain as is I believe optimized for the wrong thing
it's cheap to amend but fairly expensive to access
but access is probably more common
When accessing is done a lot, but amending is cheap
we should have something that is cheap to access and reasonable to amend
the #[fold] thing were 1pt and red merge into 1pt + red
currently it's done lazily and that means we can't access such fields via borrow
some of them are pretty big so it's unfortunate
So pre-fold merges them eagerly when found?
I think that should be fairly doable
I think we can explore these ideas alongside types
For custom types it is actually fairly natural too then
That will need a big internal rework anyway and it gives us an opportunity to start from scratch with the macros and stuff
To be fair, that would lead to the removal of #[ghost] fields, massive simplification of the #[elem] macro and a simplification of the StyleChain
I dig it
I'm curious how quickly I could whip up something
Unfortunately won't have any time until Sunday
On the proc macros, I was curious if it might make sense to do the code generation ahead-of-time? I think Rust Analyzer does this with their AST where they use a DSL and a separate build step in Cargo generates struct definitions. I was thinking that if the DSL for defining elements were to match the new custom type definition syntax for Typst, it could be a way to merge the two, and maybe even write the "default show rules" for some elements in Typst itself. Will likely have downsides in rust tooling experience though
Would require either a complex build script or a custom cargo CLI
not sure how much of a gain you'd have there
possible? yes, a heck of a lot of work? also yes
I am writing a CLI at work to do Rust <-> C++ interop and it's a 15k LOC beast
JUST FOR THE INTEROP, the CLI on its own is like 5k LOC too
Oh no
Yeah, a separate build step would be implied. It's an interesting design space to think in though
A CLI to generate the interop code? Wild
what would be the benefit of doing it that way instead of via a proc macro?
Allows for generating code in dependencies ๐
So essentially, a Styles would look like:
struct Styles {
set_rules: HashMap<Element, Arc<dyn Styled>>
}
Each time you create a new node, you clone the hashmap (would probably not use a hashmap for this but whathever, bear with me)
And each time you modify an element, you either: insert a new entry into the map, or use the existing entry and fold everything
Of course in partice, you'd still want a style chain to avoid cloning the whole style chain every time
Then to find for element E, you just look up the first instance of an E in the style chain, giving you its fully folded style at that point
what's Styled here?
a trait that makes one of these immutable structures in the chain
but since it's in the chain, it has to be in a dyn T
could be a dyn Any, just something that the element can downcast back to a concrete type
so one field value basically?
Styled: Send + Sync + DynClone or something like that
I would make it a struct of all of the element's non-required fields
ah interesting, I didn't get that
do you think that's better than one per field?
I mean it does have the benefit that we can clone just the one
that wasn't what I was proposing, but very interesting
And then each element would look like:
struct BlockElem {
// The non-required fields in an optional pointer
extra: Option<Arc<BlockElemNonRequiredField>>,
// The required fields
body: Content,
}
I would think so, memory usage should still be lower than it currently is, and it makes lookup easier and faster
also less fragmented memory when you're actually using the elem (i.e in the Show impl, etc.)
since then it's all behind a single pointer instead of being behind n pointer where n is the number of fields
so much more cache friendly
More memory usage, but more efficient for the CPU
Also easier to impleement and understand imo
And we can still always borrow from the style chain!
So less cloning = more vroom vroom
I really want to work on that now ๐
More vroom vroom = happy me ๐
I would also not store the style chain data as Values then, but as the "cast" type
That way you get less Value conversions overall
(also required for folding tbh)
I think that in itself would already save a lot of memory
And then I would store the revokation, and show rules separately
using two arrays, just to make things a bit easier, since they're essentially used in separate steps, I think it also makes sense
more allocations, but we reduce so many other allocations that it would be worth it
Oh, the style chain still uses Block so it's already the cast value, my bad, I misremembered
For liftable, those could be handled essentially the same, just adding a bitset to each elem with which properties are liftable I guess, a bit weirder I'd need to think about it some more
@left night What do you think of the per-element version? Or what upsides do you see for the per-field one (outside of memory usage)
For default values BTW, I would make the BlockElemNonRequiredField contain all Option<T> (where T is the field's type) and then get the default value from stored globals when it is None, also nicely allows detecting what was set and what was not
I think it definitely sounds like a viable option as well
For the elements liftable shouldn't matter
Only in the style chain. But yeah I guess that's what you meant.
I thought it was per-field, is it not?
hence why I said a bitset, could even be baked into the BlockElemNonRequiredField for which ones are liftable
that way it can use the minimum size needed (i.e u8 or u16, etc.)
That honestly all sounds like a pretty good idea
Behind the Arc
It does sound promising indeed
Devil is probably once again in the details though
but who knows, maybe not
I'd be curious to see the impact on memory usage & performance
I'll definitely see if I can find time to try it (grossly, not optimized, likely with bugs) and see whether it provides any gains
yeah just a quick test to check whether it's even worth it would be cool
Alright! 24 hours has passed! Did you do it @sturdy sequoia?
I wasn't even home yesterday
wtf lol
It's a risky jokey comment. Funnily enough, your response leads to a Kevin Smith reference
@left night do we need to hash the liftable?
(i.e do I need a manual impl of Hash on the style chain structs, or is a derived one fine)
I think they shouldn't but I am not sure
I think yes. Any reason why we wouldn't?
My main thinking was that since the styling structure is cloned inside of the element on realization (using arc clones), the lift able field will be included and might lead to some failure at memorising?
Once I get there I can test both anyway
Same with using a linked list of styles or always owned
But from the elem macro, most of it is implemented already, but I am at an event so I wonโt get to continue tonight, maybe tomorrow after work, donโt think Iโve got any late meetings ๐ค
I see, you mean only in the element itself
Tbh, for an initial perf investigation you can just ignore liftable entirely
Sure, but I am trying to think longer term too, might become Content Rework 3D
For the element it shouldn't be part of the hash then I think
I mean I can do a manual implant for the element
For now I would really just not worry about it
What is it used for?
is it for follow up set page(...)?
it's for page wide styles
What are page wide styles?
btw, you can now have --timings on the web app!
(currently, only available in the dev compiler version, but will also be available in 0.13+)
YOU CAN HAVE WHAT?
That's awesome! ๐
Those timings ended up being useful ๐
That's actually awesome!
Just tried it ๐
would be nice if there is some kind of success message after restarting the compiler
Also shouldn't a restart trigger a recompile? ๐ค
ah nvm, I think the compilation is just so fast I can't see it ๐
I considered removing the preview upon restart, but wasn't bothered enough
@left night would be nice once you're done recording if it gave you a link to perfetto for viewing it
yeah, same as above I wasn't bothered enough ๐
it would also be nice if some more of the web app-side stuff had #[time] annotations
right now there's some blanks in there
but the "hard" part (adjusting typst-timing) is done now
What was needed to change?
I guess making it WASM compatible?
Wow, it's a much bigger deal than I expected ๐ฎ
(lookin at the PR rn)
Makes sense, super cool stuff!
it's fun that you can generate a thread id by combination of a thread_local and an atomic static
even more fun is that wasm-bindgen uses a very similar approach to make thread locals possible in the first place
I guess it isnโt necessary, but it does at least let the user know that it actually restarted :p
Cool stuff though!
comemo can be pretty ineffective if there are more than one active "worlds" at the same time. I observed this fact in a crash when the language server was hot reloading fonts today.
The crash site is here. A typst function f uses the 472th font in some world. Then, another thread runs the function with another world which has only 17 fonts. It crashes intermediately on the self.fonts[index], complaining index out of bounds: the len is 17 but the index is 472.
fn font(&self, index: usize) -> Option<Font> {
self.fonts[index].get()
}
Beyond the crash, when executing a same function call, the multiple threads using different worlds will atomically read and write a same shared comemo entry. Is it kind of ineffective?
yeah, that's a bit dumb. we should just do a checked .get(index)? there
not much we can do about multiple threads racing on the same call
if they have at least one tracked argument, we can't know ahead of time whether it's actually the same call
I did that, may be I can also make a PR to typst.
constraint validation may also a cause of slowdown with multiple worlds, but the discussions in #1325795747165114479 outline ways to improve that
#1325795747165114479 is probably a better place to discuss this btw
I probably found an interesting case to optimize. @sturdy sequoia See this function: https://github.com/touying-typ/touying/blob/3d8819c8e13fc9fc80620da26dc7ec48970c44f0/src/utils.typ#L92-L108
The function abused by touying will at least double the cost of memory in regard to comemo caching functions unconditionally. A so-called small function optimization is possible:
- Only simple native functions are called in the function.
- Only a few of simple
if/for/import statementare executed. Aforis simple if the loop times is know statically and small. - The function is not recursive.
Then we could invoke the function without caching.
I remeber you had surveyed conditional comemo, so you might be interested in it.
Touying is a powerful package for creating presentation slides in Typst. - touying-typ/touying
I have the same issues in many places in codly, so I did not use a function for it (to avoid comemo)
I generally think there should be a content.labelled(xxx) method @left night
But I think in general small function optimization would make sense
but it's hard to "detect" AFAIK
I don't want to add new methods on content because they will need to be deprecated with content = value
Maybe we could have a very conservative detection of small functions that at least works in some cases.
#let smol f(x)=
I think C should adopt smol as an alias for inline ๐
Though it wouldn't actually work well in their syntax
Cause smol int f() vs short int f() would have very different meanings
Prefix return type were a mistake
Ok but low key
I like it
Can we have this pwease pwetty pwease @left night
I need it SO much in codly
and clearly in touying too
much better than annotations
I think coming up with a conservative complexity metric might not even be that hard
the real time system i work on at work simply counts the steps it requires to execute a function and if it's less than n it executes it regularly
and the steps are just the AST nodes there
well, sans loops and such
Can we assign every operation a complexity score, with the score of a function being the sum of those?
Maybe too naive
Something like that, but loops will need special treatment
though for a map it's not unlikely that it's just three items
in this case, it's really hard to tell
interestingly, if built-in functions and methods were memoized, it would be a moot point. but it would make the situation even worse at the same time.
yes my point is more that you have to consider them as loops too
Actually I have a good metric @left night we measure the hashing time vs the run time, if the hashing time is lower, we use memoization otherwise we don't
Obviously can't use the sys call for time but in native we can use the fasttime crate of whathever it's called (crate to get time using CPU shennanigans, forgot the exact name)
But I guess that small function detection doesn't work this way
because it's input dependent
the only solution is indeed to do it from the AST nodes
ugh
Ok, so I am experimenting (no promises on an actual PR), and I did the thing on AST nodes (giving each node a cost, kind of random based on my gut feeling
), I also am using my comemo branch with enabled for comemo::memoize and the results, at least on masterproef are disappointing:
- It turns out that straight up disabling memoization on closures saves (on masterproef) 60 MiB of RAM (i know i know calm your applause)
- Adding the cost thing makes basically no difference at any of the random "thresholds" I have set to compile time but it does reduce memory by a similar amount as outright disabling memoization. So perhaps that's an argument (just memory)
This reinforces that eval really isn't the bottleneck in real world documents, which I think is a theme that arkons back to early days of typst optimization.
Now there may be some real-world eval-heavy documents, but I'd need one of those to actually do the work of determining whether this is worth it and finding "better" values than the ones I have chose. As an aside, if this was part of my job, I'd actually go through the trouble of doing some kind of iterative process to find the gradient on those values and descend to an (likely local) optimum, but my brain is too tired to be doing that.
So does anybody have good test docs for me? @untold turret @proud sandal @sly pecan @left night @lunar kettle?
(sowwy not sowwy for the pinwgs)
Don't have any large eval heavy documents i think
Make sure to try elembic's integration tests
There's one with like 5000 custom set rules
Good idea
test/integration/many-rules.typ
I just actually ran a full instrumentation of typst, and like most of the time is spent in the grid layouter, I guess that my thesis has a lot of tables that span pages and that's bad
there's probably stuff to improve there in the future
maybe you could share your results? that could be helpful
Someone posted a large gallery of cetz drawings. That's an option
Or the cetz documentation itself
cetz-plot also has a particularly slow test
and the tests (so far) don't need tytanic to run, they don't use any of the test library features so running typst should just work
Unfortunately they're with the macOS built-in profiler so I don't know for sure how to extract it
But I love Instruments, it works so well
I mean you can just send the raw thing and i can figure it out
Unless you mean you don't know how to send the raw thing
I guess screenshots also work in a worst case scenario lol
Exactly ๐
Still haven't found a way of getting a flamegraph
I kind of narrowed it down to being codly blocks in tables that are especially gnarly
There is a fresh "cursed" single-file document. https://raw.githubusercontent.com/jthulhu/grothendieck-fibration/049d4766a3dc9a6c662ec0b884c26ee91be7f565/notes.typ
It turns out that straight up disabling memoization on closures saves (on masterproef) 60 MiB of RAM (i know i know calm your applause)
Commented out the line on closures (on the "cursed" doc) then cpu goes from 6 secs to 13 secs, mem goes from 1.92GB to 506MB:
/// Call the function in the context with the arguments.
- #[comemo::memoize]
#[allow(clippy::too_many_arguments)]
pub fn eval_closure(
More info: after disabling cache on some functions, the cpu time goes from 6 secs to 5.9 secs, and max memory usage goes from 1.92GB to 905MB.
let file_id = func_site.id();
if file_id.is_some_and(|file_id| {
let file_name = file_id.vpath().as_rooted_path().to_str().unwrap();
file_name.ends_with("utils.typ")
|| file_name.ends_with("util.typ")
|| file_name.ends_with("complex.typ")
|| file_name.ends_with("bezier.typ")
|| file_name.ends_with("vector.typ")
|| file_name.ends_with("matrix.typ")
|| file_name.ends_with("coordinate.typ")
|| file_name.ends_with("grouping.typ")
}) {
return eval_closure_inner(..);
}
eval_closure_cached(..)
That's actually super useful!
Because I can use this to figure out a good score treshold!!
BTW, I also found tons of small optimizations throughout the code base ๐
They save basically zero time
but they do improve it by like ~2% repeatably ๐คทโโ๏ธ
With the current settings I have it's 6.95s at 494MB
(on my MPB so i need to check the "native" case)
Compares to 4.8s at 2 GB for main
branch time mem
main 4.77 2 GiB
score (1000) 11.08 390 MiB
score (200) 7.22 500 MiB
score (100) 5.88 650 MiB
score (50) 5.71 700 MiB
score (40) 5.44 740 MiB
score (30) 5.48 740 MiB
score (20) 5.10 830 MiB
score (10) 4.60 1.1 GiB
score (5) 4.68 1.2 GiB
(memory obtained from `typst watch ...` time via `hyperfine typst ...`)
(ran on my M4 Max MBP, lots of other stuff running)
Clearly need to optimize scores
So clearly the score approach saves TONS of memory
I have the inkling that allowing the smol function definition would still be better but heh
Ok, I ran a proper sweep
So you can see that we can save time & memory for small functions
I'll now update it with hyperfine but it'll take a while
what about the performance on incr compile? I think it should affect a bit if we perform SFO (small func opt) on functions which have T_cache < T_exec, where T_cache is the time to find and return a cached result and T_exec is the time to execute the function.
Here it's cold compile only
Don't have time to do incremental right now (I am literally in the office working lol)
Updated with time from hyperfine
kind of like me. I oftenly play a joke saying that I made my useful tools (tinymist etc.) when I'm compiling my c++ and rust projects in my office.
looks pretty good.
M4 Ultra MBP the secret processor that gives you superpowers ๐
Sowwy M4 Max *
I also tried on masterproef and I can't really tell any difference, memory usage just slightly goes down, but compile times stay around the same, just slightly slower
So the threshold seems document dependent
I should try on one of my larger slide decks
i would like to profile the memory objects but the tool told me most were comemo entries and I didn't get more information.
I think lower memory usage by itself is useful even if it doesn't improve speed
I'm wondering if the difference would be bigger with very slow ram
If your pc ends up having to swap then it definitely does
I'm more concerned about the very memory constrained web app
But yeah
It's true that this thing [the MPB] has insane memory bandwidth
I think it's likely useful
I made the threshold parameterized with a ENV var
called TYPST_SCORE in my branch
I'll publish it tonight if I remember to do so once I am home
I will also try it on some of my work slidedecks but obviously I cannot share those for reproducibility ๐ฆ Can't even send them to my Desktop to have a second point ๐ฆ
@left night what do you think of those results? For my thesis I can reduce memory usage a bit, but most of the memory is table layout so it's less impactful there ๐ฆ
looks promising at first glance
I agree, but I need to test more docs & actually run it automated because doing it by hand is a huge pain ๐ญ
To add one more data point: for a large slide deck of mine (using Touying), I get a reduction of ~100 MiB of RAM for 62 ms increase in cold compile
This weekend (iff I have the time) I might try and measure incremental times & memory for each and do a nice set of plots, but no promises
But on my slide deck at least, there's no decrease in hot compile time but a decrease (after one slide removal then re-adding it) of 120 MiB of memory usage, so definitely seems interesting
I wonder if we could reach better results with the smol indicator
Actually not caching contextual blocks also seems to help
but I'll need to measure it in more details
How are you determining whether something should be cached or not currently?
I give it a score (done by traversing the AST when defining the closure), the score is:
-
- 1 per AST-node
-
- 10 per function call (I want to refine this to specifically array functions (
map,filter, etc.) and +1 for others but not done yet)
- 10 per function call (I want to refine this to specifically array functions (
-
- 1 per AST node in for loop iterable + 10 * body's cost for for-loop
-
- 10 * (1 per AST node in condition + body's cost) for while-loop
-
- 100 for module imports (not sure if needed tbh)
If the score is below the threshold then the function doesn't get memoized
(note that the scores could be changed, they're just what "felt" right lol)
I think that constants should likely have zero cost
I think one would have to actually measure to determine what the appropriate cost should be
Yes, but good luck with that
There are tons of different AST nodes, assigning a cost to each of them that works is tricky
Linear regression?
I mean what I could do, but it would be horrendous, is make typst load a parameter file then use some gradient descent algorithms to find optimal costs for each but I would need loads of documents and it would take forever to run (since it would need to estimate the gradients since there would be no autograd) and truth is, while I use torch at work all the time I have never done anything "sampled" like with Stochastic Gradient Descent (https://en.wikipedia.org/wiki/Stochastic_gradient_descent) something I have head of but never used and I am unsure if it is suitable ๐
Stochastic gradient descent (often abbreviated SGD) is an iterative method for optimizing an objective function with suitable smoothness properties (e.g. differentiable or subdifferentiable). It can be regarded as a stochastic approximation of gradient descent optimization, since it replaces the actual gradient (calculated from the entire data s...
And it would require many documents for testing
rather than the three I am using right now
that being said, with many docs and memory & time measurements and a suitable loss function, we should be able to reach a very nice objective potentially ๐
If you can create the dataset, than others could do an analysis.. Also, God did grace humanity with other algorithms than SGD... It is more than nested nature of things that would cause an issue I think.
I don't know much about this stuff :-p
I generally work with fully differentiable circuits (outside of quantizers which make my life difficult) and I can just use torch as-is ๐คทโโ๏ธ
If you model the cost as linear in all ast nodes, you don't actually have to do any gradient descent. You'd just have to do a bunch of measurements (more than the number of parameters) of different combinations of weights and the resulting objective (execution time?)
Ok, but you have like all of the different ones? or do you do it one-by-one?
I'm not sure I understand what you're asking about
I have a conway's game of life simulator, but its probably bottlenecked in either layout or being slow from copy on write:
https://typst.app/project/riswjh0KOErxkrhx1UTHQj
(You can scale N to make it as big as you want)
that's cursed in the best way ๐
Lots of small functions too!
The ray tracer is also an option
Ok, I automated benchmarking and it's running benchmarks on three docs atm, will check back in once they're done (likely > 1 hour)
It includes benchmarking memory usage (so it runs watch and measure memory then kills it)
masterproef notes slides
time mem time mem time mem
score
0 3.265 s 1483.3 MiB 5.026 s 2001.2 MiB 1.478 s 811.0 MiB
5 3.673 s 1457.6 MiB 5.039 s 1862.0 MiB 1.482 s 794.7 MiB
10 3.638 s 1427.0 MiB 4.990 s 1523.5 MiB 1.478 s 720.2 MiB
20 3.673 s 1412.7 MiB 5.454 s 899.6 MiB 1.510 s 722.3 MiB
30 3.821 s 1430.9 MiB 5.693 s 852.2 MiB 1.512 s 714.9 MiB
40 3.827 s 1412.7 MiB 6.034 s 783.0 MiB 1.515 s 710.0 MiB
50 3.832 s 1386.3 MiB 6.096 s 695.1 MiB 1.524 s 699.1 MiB
100 3.822 s 1401.9 MiB 6.018 s 623.9 MiB 1.525 s 690.9 MiB
150 3.820 s 1381.1 MiB 6.493 s 565.2 MiB 1.529 s 687.0 MiB
200 3.822 s 1370.7 MiB 6.548 s 482.6 MiB 1.528 s 684.0 MiB
250 3.815 s 1393.6 MiB 6.925 s 534.3 MiB 1.530 s 688.8 MiB
500 3.820 s 1377.2 MiB 9.650 s 404.5 MiB 1.538 s 703.0 MiB
1000 3.818 s 1368.6 MiB 11.108 s 373.9 MiB 1.540 s 692.9 MiB
1000000 4.075 s 1352.0 MiB 16.124 s 360.8 MiB 1.663 s 663.2 MiB
So, notes on this:
- ran with
-j 1 - masterproef = my thesis, notes = the
notes.typposted here earlier, slides = a slide deck from my job - ran on a Mac Book Pro M4 Max 128 GB RAM
- a score of 0 means all functions are memoized
- a score of 1000000 means that (essentially) none are memoized
So basically not worth it?
I'm getting the opposite translation. Everything is memoized right now. If nothing is memoized, then some things are improving by a large margin in memory, but not in time. There is a clear trade-off occurring.
I'm basically looking at it as the score being the Treatment, and these three are different patients.. God a survival curve would be nice here.
In all instances the best performance is achieved with basically everything memoized
Though for one of the documents in particular there is a big memory tradeoff
Slides gained 148 MiB for loss in time by 0.185 s. I think that's also something.
To me itโs also the opposite, small scores provide a nice decrease in memory usage at almost now performance cost
Therefore itโs a valid thing. The problem is that most of the savings is in eval which isnโt the majority of work. Layout is
I think that if time is what we want to save then eval is the wrong place to look
Again, I might be terribly shortsighted, but the memory usage is too much for my nerves. The notes thing goes from 2 gigs to 360 MiB.. I think this needs to be dealt with somehow.
I agree, small function memoization is a big issue, actually I could simplify codly a lot if we had small function memoization
but I literally inlined EVERY SINGLE SMALL FUNCTION in the main file to reduce memory usage and execution time
I LITERALLY MADE PRE-PROCESSOR MACROS TO DO THIS
And honestly looking at these plots, a score of 20 sounds delightful
I think these plots would be easier to judge if the time and memory would start at zero at the bottom
Aye can do that
The way it is right now, the slope is not really meaningful
@left night
Obviously my thesis is a special case because it's all table layout ๐
For the notes.typ a score of 20 gets you half the RAM for basically the same performance and on my slides a nice 100 MiB reduction for the same performance
Is it possible to get more granular measurements around 20?
Sure, give me like 6 hours for it to run ๐ญ
These things are sloooooow
I tried a bunch of stuff with the data you provided yesterday, and it all came out non-sense, or atleast non-sense to me. I think we don't have enough around the infliction point (also the 2 responses is new to me ๐ฆ )
Should take around ~1 hour
I also think I should try various scores for different AST nodes
but that becomes painful
Thanks, that's much easier to read for me
It does indeed look like we can get a nice amount of memory saving
I think for improving performance rather than memory usage, the key is to reduce function call overhead more generally, and not memoization specifically
I think that if we do this, I can make codly have functions again and we will see a nice reduction in masterproef too since codly will be memoizable again (which I explicitely made it not be)
Well memoization causes large memory usage, but in some cases, it's not strictly necessary, I still think having a smol keyword could lead to more significant benefits, but I can see the argument against manual control here
I'll actually add raytracer & conway's game of life too
since they're more eval heavy
What's could also be a worthwhile optimization is a move optimization
Where a value is moved instead of cloned on its last use
That way, things like arr = push-one(arr) could reuse the array instead of doing clone-on-write all the time.
But that would require two-phase eval
Similar things with not re-evaluating constants all the time
I think this goes into a similar direction as things you explored in your bytecode VM
I KNOW
On my branch I have also been hunting usages of .clone() and .cloned() throughout the code base
I actually manage to save some compile time there too
The VM could do it while compiling 

(is it time for VM3?)
it'll work this time
I mean it always worked
but for real docs, the compile time just removes the benefits
outside of memory usage iirc
but with move semantics, more borrowing, & small function optimization? could be interesting ๐
I also still think that if we cached package compilation it would be ๐ฅ
Like when first downloading a package, it gets cache compiled locally
wouldn't that be neat
it would indeed, but then you'd need a stable ABI of sorts
*especially when type checking is a thing
I don't think you would?
That would probably be very useful for custom type performance too ๐คทโโ๏ธ
rust doesn't have one and still does that
The INSR wouldn't even need to be stable, it could have a header containing the typst version
and just re-cache if the typst version has changed ๐คทโโ๏ธ
right sure that works
I still feel like it would be possible to make these optimizations as incremental improvements which could turn more and more VM-style instead of outright replacing eval immediately
Not stable, but serializable
true, but at some point you'd still have a CHONKY PR that actually implements bytecode VM
Which already is a bit of pain
But only spans are a pain, right?
Also the constant pool
Which contains arbitrary values sort of
Especially if you constant fold a ton (because we can do that!)
Anything that is well-known during compile, could as compiled and run right away
I think it would be sort of a hybrid
Not really, at least not with how the "old" VM worked, the constants were strings, numbers, etc. not dict, arrays, etc.
But they could be
Yes
An interleaved/hybrid compiler/interpreter that always runs things as soon as it knows enough would be interesting I think
This way, things run the least amount of times necessary
And it's only possible because of all the guarantees Typst can make
Compared to e.g. JS
loop?
compiler *
Essentially we'd have two evals: eval, compiler + vm
with three components
or I guess we could vm-eval instructions as we go
yeah, I think that's closer to what I meant
it's a bit like Zig works I guess, edit: though, maybe no
We could even keep the state of the VM as we go actually
and then resume with a prepared VM
as for a VM, I'm not sure how the instruction set of yours looked like
but I wonder whether higher-level instruction would be more the way to go than emitting lots of low-level ones
Definitely
I think the low-level ones were too complex
I think the performance problems of the current eval is not even really the interpretation
It's
- the AST traversal (especially since it's a CST)
- the lack of optimizations
- the function call overhead
but it could even be okay to compile a while loop to a while loop instruction rather than some jump
I think the original VM (due to being low level) was too slow for making args too
Agreed
The old vm was too low-level and as a consequence too complex
especially since set & show can't be well expressed at a low level anyway
field access is also very messy in Typst ๐ฆ
Because it can be mutable or not
and it's hard to know
because of mutable methods
If we had type information (which we can't realistically)
we could at least do it more easily
but in practice we don't and we can't have multiple mutable references because of rust
but in practice we know with 95% certainty statically whether it's mutable or not, just from the name
perhaps that is enough to make it fast
the 5% can be a slow path
true
I need to remind myself of how I had done it
It's... overly complicated
One thing I could do in the VM is have one "mutable reference register" that uses some unsafe code to work, but has checks to make it safe
allowing to make accesses just chains of mutable in that register
what is the concrete problem you're trying to tackle right now?
mutable access
I mean not right now, I am just brainstorming
mutable access requires you to have a mutable reference to a value of the VM (self-referential mutable ref)
and then be able to "chain" those self-referential mutable ref (for nested access)
I meant right now as-in brainstorming
So you want &mut Value and &mut Vm passed to the same function and the value resides in the Vm?
yes which is nice
but I remember it being a bit tricky to deal with that
but the VM would break it down in several INSR which would kind of be needed
what I ended up doing was a single INSR for accesses but it was so slooooow
and so overly complicated
I think honestly improving eval performance is less about making documents go faster now, and more about enabling package authors to build more complex things without worrying about making documents go slower.
Things like cetz, lilaq, or codly
Like, you shouldn't have to inline all the functions by hand just to make it go a bit faster
That's just frustrating. Especially for Rust developers that are so used to more or less zero cost abstractions.
True
But it's tricky to figure out how to improve things
clearly disabling memoization helps but it doesn't completely solve it
It's still running but I can already tell that the results for @atomic violet's raytracer are interesting ๐
@left night @feral imp There are less samples so the data is more noisy making it less useful ๐ฆ
I'll run it with more points per-score tonight
@sturdy sequoia For the convey program, you could annotate the exact functions filtered by the scoring algorithm at the critical points on figures. This sounds meaningful to me.
I'll try that ๐
Benchmarking the raytracer is... interesting ๐ญ
Also, it's an interesting doc because you can really see that all of the small function don't get cached which saves time & memory. I am not so surprised because when I profiled the raytraycer, most of the time was spent hashing arrays
At least this evening we should have the new plots, I'll also make some zooms in the 10-50 range for scores
@feral imp @left night @sly pecan Noise free data
Zoom from score = 0 to score = 50
To me it still shows that a score in the ~20 neighborhood is best in most cases, even masterproef
It's free real estate
Better scaling on zoomed plots ๐
masterproef ฮt = +13.94%
masterproef ฮmem = -3.97% (-58.06 MiB)
---------------------------------------------
notes.typ ฮt = +8.80%
notes.typ ฮmem = -55.02% (-1099.09 MiB)
---------------------------------------------
slides ฮt = +2.32%
slides ฮmem = -11.73% (-94.02 MiB)
---------------------------------------------
raytracer ฮt = +7.30%
raytracer ฮmem = -30.78% (-850.86 MiB)
---------------------------------------------
conway ฮt = -38.07%
conway ฮmem = -62.22% (-273.55 MiB)
With a proposed score of 20, this leads to those change in execution time (+ means it increased, - means it decreased)
I need to make a version of codly with the small functions again and test masterproef with that version of codly
I am pretty sure it would show a massive difference
I also find it fascinating that the conway impl has so many small function the compile time actually decreases
@untold turret my slides use touying but I am surprised by the increase compile time rather than decrease :/
So any comments @left night? ๐
I've also added a feature (which likely won't be in the PR -- if we get there -- or maybe?) that produces tha score for each closure when --timings are enabled
And looking at it, I have a request (which might not get implemented) for Laurenz: we need a matrix API (at least 2D matrices, not generic tensors) because most methods in cetz in matrix.typ have huge scores but I think they would be orders of magnitude faster if they were native
and I think that there is an argument to be made that for Typst, a typographical tool, to have a matrix & basic transform API built-in for drawing libraries
For example:
{
"span": "@preview/cetz:0.3.4/src/matrix.typ:231",
"name": "mul-mat",
"cost": 25288
},
{
"span": "@preview/cetz:0.3.4/src/matrix.typ:281",
"name": "mul-vec",
"cost": 1437
},
{
"span": "@preview/cetz:0.3.4/src/matrix.typ:298",
"name": "inverse",
"cost": 264652
},
remember that each AST node is worth 1 "cost", and a function call is worth 1 unless it's an array function in which case it's worth 10, loops are worth 10 too
So a score of 264k means this is a chonker
(nested loop mean that for each level it's squared, so a loop in a loop has a cost of 100, a loop in a loop in a loop a score of 1000)
Or if we want to have this being possible to make with plugins, I think an opaque API that allows plugin to have managed memory (i.e a Value that points to a value inside of the plugin) that way we can avoid the back-and-forth copying of values
That would actually be kind of awesome
With a simple API for freeing them (an opaque could have a reference counter) and then a simple free(type_id, opaque_addr) where type Opaque = (RefCounter, type_id, opaque_addr) where the type_id is also opaque and defined by the plugin, and the opaque_addr is just a 32-bit address in the plugin's memory
Technically this breaks purity, but if we assume the plugin is pure, then it should work
(BTW this could be implemented already, but would lack deallocation since there is no way of freeing a value)
That would make a numtyp to be possible to build in Typst using a plugin for better performance
It looks promising! Picking exactly 20 might be overfitting to raytracer and conway a bit, but it's definitely a good start.
I think the purity constraints might cause too many problems here, also with plugin multithreading and such.
I hadn't considered the multithreadedness, but I think that making it required that they're immutable would be okay, no? Worst case scenarios, plugins will break...
(the plugins doing things wrong *)
Well that would be a bad worst case :p
okay haha
Yeah I saw "laurmaedje typing..." and realized I was missing a key point
lol
I'm not sure right now tbh. Maybe it could work.
The thing is that free is sort of inherently a mutable operation
I like the ability of building numtyp with actual performance
yes ๐ฆ
If we couldn't free things (and we assume immutability) it would kind of work
but we also need to be able to free memory lol
It's a bit of a question of whether it's more worthwhile to try and reduce plugin overhead to the point where many small calls are cheaper or whether it's more worthwhile to optimize performance of numeric Typst code.
With enough dev time, I think Typst could do a ton of optimistic optimizations on numeric code
I think that having a vector + matrix API would already do most of the heavy lifting
Probably yes
To be fair, that's also an argument for compiling to bytecode, we can do far more optimizations than in an AST-based interpreter AFAIK
Yes, though I think it doesn't have to be bytecode per se
We can be creative with whatever kind of data structure we think serves our purposes best
There might be some different trade-offs in Typst than in your average programming language
(Or maybe I just like to reinvent the wheel ๐ )
Tbf the concept of typesetting languages and their optimizations is relatively unexplored so far I'd say lol
Since it doesn't quite fit in the typical compiler model
Yes, definitely. And even among normal PLs, an imperative, pure language with value semantics is one of the rarer things.
yes bytecode is a bad term
I agree
just some "lower-level and linear" representation
linear = not a tree, just a big array of ops
I wonder if by making the structure higher-level (and therefore removing the issue of slow compilation) we can see a benefit
because eval was ~4x faster but compilation often made it impractical
but if we could have partial compilation (i.e just like partial re-parsing -- forgot the proper name) & higher-level struct that lead to smaller compilation time we might be able to benefit more from it
incremental parsing? therefore incremental compilation?
could be a cool fun area to explore
agreed
Also, by not doing a register-based VM but a 'slot' (i.e local vars) based system, identifying the last use of a 'slot' (or even an unused slot) should be trivial by doing a first pass over all of the ops, and replacing the last one with a "move" version of the operation
Therefore doing value moving
We can even make it easier by in the compiler having a list of 'slots' with the 'last use' stored in it, then we just replace all of the 'last use' and if it's none then because no side effects -> we yeet the entire slot
then once compiled we don't carry that information
Also by not being register based it should be much easier to debug since each var points to one 'slot'
and temporary values can use registers still or just 'short lived' slots that are move-only
Because in the original VM I allowed reference to registers because variables could be in registers
but if it's only ever intermediary values, then they're always moved never borrowed
Cool optimization of separating them!
So we can probably save quite a few clones!
Ugh, I am working on this after work ๐
VM 3? ๐
VM1
VM2: Electric Boogaloo
VM3: Photonic Boogaloo?
I guess we could continue making some hacking on compiler to figure out the bottleneck of packages. we might gain a lot without needing to compile all functions but a small number of ones from existing packages. cetz was greatly optimized while touying didn't, reflecting thar cetz should have different workload from touying. I also remember the performance of polylux (not the touying) was greatly affected by some internal parameter, which could be a good start to learn which case is creating huge state (and bloating the size of the cache). https://github.com/Myriad-Dreamin/tinymist/issues/513#issuecomment-2322834251
I've felt like we're much closer to Haskell than we have any right to be, so maybe worth studying some GHC internals
In particular, lazy evaluation might be worth investigating
That might affect semantics though, doesn't it?
Depending on what becomes lazy
Not in a pure language with our semantics for state objects. Our content is exactly like Haskell IO
I think it will have semantic differences for performance, but not for output
I meant semantic difference in terms of what errors are raised
If the code doesn't run, it can't error
Ah, maybe
That is the one non-pure effect that we have
It's not necessarily a side effect. You can model it solely with return values after all. But it still affects lazyness.
Though I guess you can model everything in a pure way if you really want to
the compiler could change semantics on raising error if necessary I think. you can ensure nothing but only raise ones during computing.
A problem is probably that being closer to fp doesn't quite mean get better performance. there has been some laziness in typst. i'm interested if there is some hopeful abstract ion (to motivate investigation) could help typst be more lazy and get benefit.
Ok, so I just started work on VM3D, and I managed to find a new way of writing the compiler that no longer requires recursive iteration to find nesting, joining, etc. which already significantly simplifies the compiler
@left night Wanted to quickly run this design by you since it involves one unsafe line:
- Essentially for mutable access, I am solving it by "locking" the VM (i.e preventing reads & writes to slots = local vars)
- Then I recursively descend the access (does not involve running any instructions but
Callinstructions which don't need to read slots, only the stack) - This allows me to alias a pointer on the slots (local vars) safely
/// Locks the VM to get mutable access to one of its slots.
///
/// During the lock, no other access whether mutable or immutable can be done
/// to any other slots.
pub fn lock<F>(
&mut self,
slot: usize,
method: impl FnOnce(&mut Self, &mut Value)
) where F: FnOnce(&mut Self, &mut Value) {
assert!(!self.lock.get(), "trying to lock a locked VM");
self.lock.set(true);
// Safety:
// - we have asserted that only one mutable reference to the VM exists (&mut self)
// - we have asserted that the VM was not locked before (!self.lock.get())
// - we know the reference cannot escape the closure
// - we know there will be no other aliases (locked)
// - we have exclusive lock on the VM `self.lock.set(true)`
let slot = unsafe {
let ptr = &mut self.slots[slot] as *mut Value;
&mut *ptr
};
method(self, slot);
self.lock.set(false);
}
(this is a method implemented on the new Vm)
You should note that data can escape the closure if you add some lifetime bounds later, this should be avoided by whomever refactors this in the future.
specifically if you add a lifetime 'f : 'v with 'F: 'f and &'v mut Value you can extract the reference.
essentially you've implemented a mutex now, right?
can't you just put the slots into a mutex or refcell?
and under normal circumstances you call RefCell::get_mut to skip the overhead
yes, but it needs to be manually implemented because I don't want the cost of using it elsewhere
I'll add manual lifetimes bounded to the &mut self lifetime
Ah, I see what you mean @left night !
but no, right? because then my aliasing mutable is on the whole Vm no?
I think as is is fine in terms of bounds, but adding more later for whatever reason may invalidate the safety requirements of the unsafe block
subtly
if vm.lock takes &mut self how can it alias anyway? through a nested call within method?
I'm not sure whether the code is safe to be honest. since it created two aliasing mutable references. even if they aren't used, isn't that UB by itself?
where does it create two mutable aliases?
Actually found a nicer, safer way:
/// Locks the VM to get mutable access to one of its slots.
///
/// During the lock, no other access whether mutable or immutable can be done
/// to any other slots.
pub fn lock<F>(&mut self, slot: usize, method: impl FnOnce(&mut Self, &mut Value))
where
F: FnOnce(&mut Self, &'_ mut Value),
{
let Some(mut slots) = self.slots.take() else {
panic!("trying to lock a locked VM");
};
method(self, &mut slots[slot]);
self.slots = Some(slots);
}
It's also zero-alloc
okay, maybe it doesn't actually create two to the exact same data
only one that is a descendant of another
and no aliasing can ever occur here
yeah, if slots are cheap to move, that seems like a quite reasonable solution!
it's just a Vec<Value>
It prevents me from making it a SmallVec<[Value; x]> but in my testing on the old VM it didn't provide any major advantages ๐คทโโ๏ธ
(for registers back then)
was slots in an option before?
no
can't you just replace it by an empty vec temporarily?
yes but then I can't check if it's locked or not
(it will then panic due to oob accesses)
depends on whether you check it
very much depends on how the rest of the VM works I guess
Actually the size_of::<Option<Vec<Value>>>() and size_of::<Vec<Value>>() is the same
so there's already optimization of repr
the size is the same, but if you call unwrap in other places where you are sure it's there, that'll be extra code
but it's true that currently there's the is_some check and the bounds check (panicking) with your technique you'd just have the bounds check
true
I'll use the bounds check technique
but not bool
it can just panic since it should never happen either way
(that's literally why there's a compiler)
/// Locks the VM to get mutable access to one of its slots.
///
/// During the lock, no other access whether mutable or immutable can be done
/// to any other slots.
pub fn lock<F>(&mut self, slot: usize, method: impl FnOnce(&mut Self, &mut Value))
where
F: FnOnce(&mut Self, &'_ mut Value),
{
let mut slots = std::mem::take(&mut self.slots);
method(self, &mut slots[slot]);
self.slots = slots;
}
I'll add a boolean but only in debug mode
(to have a debug_assert nicely in there)
does '_ always bind to the lowest common declaration? i.e. to the closure bound in this case?
I think so, but I'll remove it, it's not needed anymore
because if you introduce an explicit bound on lock and use it for the closure it can escape, that's preculiar
you may also notice that I have the wrong generic param definition lol
/// Locks the VM to get mutable access to one of its slots.
///
/// During the lock, no other access whether mutable or immutable can be done
/// to any other slots.
pub fn lock(&mut self, slot: usize, method: impl FnOnce(&mut Self, &mut Value)) {
#[cfg(debug_assertions)]
assert!(!self.lock, "attempting to lock a locked VM");
let mut slots = std::mem::take(&mut self.slots);
method(self, &mut slots[slot]);
self.slots = slots;
}
Complete with the debug check
i have chosen to ignore that
(the #[cfg(debug_assertions)] is needed due to self.lock only being declared in debug mode)
needs to set the lock as well :p

well on the new version it correctly catches the alias problem, i love rust
/// Locks the VM to get mutable access to one of its slots.
///
/// During the lock, no other access whether mutable or immutable can be done
/// to any other slots.
pub fn lock(&mut self, slot: usize, method: impl FnOnce(&mut Self, &mut Value)) {
#[cfg(debug_assertions)]
{
assert!(!self.lock, "attempting to lock a locked VM");
self.lock = true;
}
let mut slots = std::mem::take(&mut self.slots);
method(self, &mut slots[slot]);
self.slots = slots;
#[cfg(debug_assertions)]
{
self.lock = false;
}
}
Rust is love, rust is life
The move makes a valid escaping reference impossible to express in the type system
The nice way with the new access method I built (using this weird shennanigan) is that it should be heaps faster than the old one and is much easier and shorter code wise
pub struct MutAccess {
/// The head of a potentially mutable access.
///
/// If mutability is an option, the head is *always* a slot.
/// Otherwise this can be resolved with `RefAccess`.
pub head: usize,
pub segments: SmallVec<[AccessSegment; 2]>,
pub span: Span,
}
pub enum AccessSegment {
Field { name: EcoString, span: Span },
Call { method: EcoString, span: Span },
}
impl MutAccess {
fn eval<O>(
&self,
vm: &mut Vm,
ops: impl FnOnce(&mut Value) -> SourceResult<O>,
) -> SourceResult<O> {
vm.lock(self.head, |vm, value| {
recursive_access(vm, value, self.segments.iter(), ops)
})
}
}
fn recursive_access<'a, O>(
vm: &mut Vm,
value: &mut Value,
mut segments: impl Iterator<Item = &'a AccessSegment>,
ops: impl FnOnce(&mut Value) -> SourceResult<O>,
) -> SourceResult<O> {
let Some(next) = segments.next() else {
return ops(value);
};
match next {
AccessSegment::Field { name, span } => {
let dict = access_dict(value).at(*span)?;
recursive_access(vm, dict.at_mut(name).at(*span)?, segments, ops)
}
AccessSegment::Call { method, span } => {
let Value::Args(args) = vm.pop().at(*span)? else {
bail!(
*span,
"expected arguments on stack";
hint: "this is a compiler bug"
);
};
let next = call_method_access(value, method, args, *span)?;
recursive_access(vm, next, segments, ops)
}
}
}
fn call_method_access<'a>(
value: &'a mut Value,
method: &str,
mut args: Args,
span: Span,
) -> SourceResult<&'a mut Value> {
... // Same as before
}
fn access_dict(parent: &mut Value) -> HintedStrResult<&mut Dict> {
... // Same as before
}
/// The missing method error message.
#[cold]
fn missing_method(ty: Type, method: &str) -> String {
format!("type {ty} has no method `{method}`")
}
This is the entire logic for making accesses happen with this technique ๐
Most of which is actually the old logic too
which makes me very happy
Will make access segments use PicoStr eventually tho
It's nice because temp value accesses can be detected at compile time making the logic a bit simpler
the only case I haven't dealt with is the "basic" assignment
but I got to go to work early tomorrow ๐ฆ
Things are moving along nicely, the new compiler+vm (VM3D) will most likely be much smaller (less than half) of the old compiler+vm (VM2: Electric Boogaloo). It is less of a compiler+vm and more of a "flattening" of the tree into a nice list of operations (although it already contains tons of optimizations)
(that being said I have not done loops yet lol)
I also think it's much easier to understand
is there a public branch already?
@dherse loops don't matter a lot for some documents. I wonder some micro metrics of compiling and running some functions of cetz or some other packages.
I did run a benchmark of fibnacci(20) of same impl but with typst or rust. rust function took 40us and typst took 800us. I think we could also get a number between them on vm3 for it quickly.
that's less overhead than I was expecting
you know fibnacci can be memorized easily. I was trying to make a big news to say typst is faster than rust on some case, but typst lost :(. vm3 looks promising.
not yet
I agree, but loops are actually rather simple at the end of the day ๐
The thing is that I am going expr by expr implement the Compile trait and necessary Instruction trait since I am also planning on experimenting with different repr for instructions (a big enum, a big Vec<Box<dyn Instruction>>, a Vec<&syn Instruction> (which should be equivalent) and a twisted Vec<Fn(&mut Vm) -> SourceResult<()>>)
VM3D is very early days, it's just so much simpler, I am hoping that its "linear" structure will give us performance
You know what's great about being on VM3D? I can just reuse pieces of VM 2: Electric Boogaloo
I have recently been confronted with huge compilation time for par.lines (1s grows to 21s) in part due to layout divergence in some edge cases, which requires re-doing everything layout-wise, or something close to this effect.
Debugging further by looking at the timings, it seems like most of the latency when the layout has to be re-done comes from blocks (equations, etc.) where for some reason the par.line loop (realize-collect, or what I assume to be par lines) has significantly higher runtime than usual.
In a normal compilation, the elapsed time between realize-collect iterations is more or less <1ms but on longer compilation, it can take up to 5ms for the next it to start, which increases the overall compilation time by a huge amount. Anyway, I disabled par line numbering for unneeded blocks such as images/tables/equations, and that improves the runtime. I attach a "normal" timing and a "longer" timing zoomed in. I can also share the timings if needed!
In any case, is there any issue that tracks the above-described issue, or something related?
Interestingly, I have never actually looked at the "cost" of line numbering, if you disable them, does the timing go down? (i.e is the increased caused by line numbering) Or is it something like a show rule making a mess?
without any line numbering, the compilation is blazing fast. I don't think it's a show rule in particular that's having this effect?
Ok, I never looked into how line numbers work tbh
Could @glad urchin chime in (I think you did the line numbers, right?)
to add to that, without any line numbering you get a "normal" timing, where every step is executed without any delay
For some reason line numbering resolves into a very sparse execution in between each block
no it's likely that some function is not annotated with the macro so it looks like a gap but typst is busy, just not shown
T_T is it time to compile typst with new macros ๐ ?
probably yeah :/
A nice thing I changed from VM2 to VM3D regarding function call is dedicated instructions for method calls, potentially mutating method calls, and function calls, this should generally lead to a nice speedup I expect, it also greatly simplifies the complexity of the old function calls at the cost of midly more code overall (since each type of call can lead to the "common" type of a simple function call)
Well, it could be looked into, but in principle it does use introspection
And it attaches to every single line of everything
So if you have lots of lines that's lots of stuff it's doing
Still, 21s is a lot, but there's too little info here
If you could find a MRE, that'd be lovely
(cc @pearl sedge )
Ok: function calls, method calls, destructuring, and a bunch of other stuff are in, still missing a lot but getting there!
But the "big" things are done and it's about 10% bigger than typst-eval for now, it's still missing show and set rules as well as a few other things, but I am confident it should be no more than 50% bigger than typst-eval (compared to VM2 which was ~4x bigger iirc)
So still an increase but most of the space is actually taken up by the fact that every single instructions need a struct definition which makes it very big, but the actual code is about the same size
This is more of a "flattening" than a straight up Bytecode VMs like the other ones
@left night How comfortable are you with this increase in overall code size (note that I am not comparing to typst as a whole, just typst-eval crate)
||I am greatly enjoying being on holiday from my day job just to work on typst lol||
Still missing is: joining, control flow, content blocks, set/show rules, some types of patterns, some types of accesses, all of math, closures, let binding, anything module related, most control flow
But most of these are deceptively simple since they reuse code that's already there (all control flow rely on the VM2 code which I have already ported over), math is simple just repetitive, let bindings rely on patterns which are there, etc.
So it's coming along nicely
10% more is not a lot!
50% is of course already a bit, but typst-eval is only 2872 lines right now, so that's manageable overall
Nice!
At least it's not completely useless ๐
It has potential
(doesn't mean it'll be merged but good to know it's not just wasted effort)
I might make a macro for instructions to make their decl a bit smaller because like... a lot of the code is just the impl for the new functions lol
I think it is a bit inevitable that Typst's evaluation system will be replaced by something faster at some point. It might be VM3 or VM3 might inform future efforts. I think it's not wasted either way. :)
I agree but I think the main problem with VM1 & VM2 were that they were true bytecode VMs, this one won't be, it's just flattening with some cleverness baked in some places
I am also massively simplifying a bunch of things like argument handling, etc.
Also, it opens the possibility for interning some strings still, namely argument names can be moved to PicoStr
That's the hardest part, empirically that 21s edge case for a 30page document 50eq/50text that only occurs on layout divergence
I can try to whip up something close to my document later today but I'm unsure whether that'd reproduce the issue. We'll see!
Because doing that doens't make sense on main because you need to intern right before comparing which doesn't gain much performance
lorem function is your best friend
there's also this tool to mangle your document such that it's not readable
like it changes the words by words of equivalent length but different meaning
I tried that one but equations are not supported yet! Time to whip out my best math BS generator
First piece of content rendered using the VM3D ๐
What is missing to be able to compile more complicated documents?
Show rules, set rules, most control flow isn't there, lots of small things like that
Function calls work, methods etc. too
Just a bunch of small things
Thankfully with the VM2 I was able to grab the (very) optimized Joiner for handling joining & styles, lots of the control flow logic, etc. I also gave some new capabilities for handling control flow which should be nice
also as a side effects, we should get show and set rules as values "for free" although they are currently explicitly forbidden
but because the Joiner can join Styles with Content you can technically just do it
And looks like I was right on my ~50% increase estimate, I am not fully done but I have all of the big things now and it's about 50% bigger, but I will do a cleanup pass later too
Oh and math is just... not implemented
because I just don't care atm, it's always (in every VM so far) the last thing I implement
because it's not needed for testing performance & most docs (since I can just render math as an empty content in the meantime)
That makes sense
(in every VM so far)
seems like this is a recurring thing now ๐
maybe if we iterate enough times with a procedurally generated vm we'll eventually converge into the perfect vm
Lots of things are starting to work!
VM3D you mean 
not perfect.
We needed 2 content reworks, and now 3 VMs
what will take four iterations? 
But it's getting close to 1:30 AM, time for sleep 
it only stops when the document takes 1 cpu instruction to compile and has zero bugs
that's called a memcpy ๐
then we've solved it
(and that isn't even 1 op code ๐ฆ )
turns out we just need to predict the output of every document and embed it into the binary
not sure how i didnt think of that before
for people that don't understand this ref
I would also like to say that the VM is not, in fact, vibe coded
that's a relief
Took a bit of a break from work (since I didn't take a lunch break) and fixed joining ๐
It's still missing set and show rules, math, contextual, but otherwise it's all in there I think
BTW, for those of you that followed up, VM1 & VM2 were both register based VMs, but VM3 is stack based which overall is quite a bit simpler
With dedicated "slots" for arguments & variables
Damn good at hello world though!
