#Performance
1 messages ยท Page 2 of 1
Pretty similar to pitching a startup to investors :)
I really think it is. And it's totally awesome that it's almost trivial due to purity.
Try doing that with LaTeX!
And rust really does make that a breeze imo
just the Deferred abstraction being this easy kinda blows my mind
fearless concurrency for sure!
@left night I can already tell you: deferred compression makes a big difference
I'm not done testing but it looks very promising
@left night do you know what might cause comemo to randomly miss caches that should hit? ๐ค
in which code path?
in comemo::memoized with tracked values specifically
and this happens just with your changes you mean? or do you mean that you found a bug in comemo?
with my changes afaik
then, it's really hard to say
anything could have that effect
wrong constraints for instance
@left night pretty sure I know why, I'm using a hashmap instead of an IndexMap so ordering is random ๐คฆโโ๏ธ
Well no
goddamn
Ok, I determined that it's an issue with the accelerator ๐
No, it just happens less without the acceleraotr
GODDAMN BUG
HEISENBUGS ARE THE WORSE
waltuh
@left night since you know comemo better, maybe you can figure it out by looking over cache.rs: https://github.com/Dherse/comemo/blob/main/src/cache.rs
Ok, the plot thickens: if I run only one test, it never fails (I literally wrote a bash script that runs the test until it fails)
But if I run multiple tests then one of them sometimes fails
weird
very weird
Ok, it's eviction that seems to cause issues
OF COURSE
GOD DAMNIT
Tests are un in parallel
so it can evict while another test is running
๐
Since it's not thread local anymore
Bingo, that was it
GODDAMN that was an annoying bug to track
makes sense
@left night could you test your parallel layout with this version of comemo please ๐ฅฐ
I wonder if it's any faster
To get back to just Deferred for the PDF stream compressions, the gains are... meh, cold they're pretty good: around 6%, but incremental they're basically non-existent because the deflated streams are already mostly memoized anyway ๐ฆ
Mind you larger docs will see larger gains
(can you send me that lorem test doc btw?)
@sturdy sequoia can you like almost guess what the performance benefit would be to native with Dash Map?
for some reason, it is actually a fair bit slower with that branch (on your thesis) :/
this is with the threaded branch of the comemo repository
this is with your pull request (+ one line addition to the macro to make it compile. it needs Send + Sync on the trait impl).
I have pushed the experiment to the parallel-test branch, so feel free to test yourself.
One difference I'm seeing is that the my threaded still uses DashMap because I've written that before I knew about the Safari issue.
That might explain the huge difference in performance ๐ฑ
I'm guessing the contention on the locks is very heavy
I'll test it locally and see with a DashMap 'cause I'm pretty sure it will be a fair bit faster
What line? ๐ค
Thanks ๐
pential issue? :D
whoopsy ๐
It's late
@left night just so you know, in my slowest incremental test it divides (with my comemo) incremental time by four
@left night you're also using parking_lot whereas I'm still using std locks
I still think parking_lot is better because they don't poison!
So, your version wins on cold compiles but gets absolutely demolished on incremental thanks to all of my other optimizations, on my end you're slightly faster cold by about 5% but incremental your version of comemo is usually ~25% slower than mine.
pardon the mouse handwriting ๐
๐
Mind you, compared to main, both of these destroy main, my version of comemo with your parallel code leads to 66% faster cold, and between 60 and 242% faster incremental
parking_lot on my version makes no difference outside of removing all of the ugly .unwrap() calls
And actually, it makes sense that your version is faster in cold: in incremental there is less contention over the locks, and while my version suffers from lock contention, your does not (or not as much).
did you try yours with DashMap?
I also think that by not having functions race, we might get better performance (i.e by having a built-in Deferred<T>)
not yet
I am doing the change now
but it's quite in-depth since I have RwLock everywhere
It's tricky to not race though
compile e.g. has no non-tracked argument, so it would allow one compilation at once
which is maybe not a problem for our use case but strange conceptually
ah, you might be right indeed, my idea was to push it early to the cache and have the cache store a Deferred<T> essentially as the output
Just making the ACCELERATOR use a DashMap bumps performance on-par with yours in cold
It does make incremental compilation slightly slower
so it's kind of a tradeoff between cold & hot compile times
mind you I'm pretty sure I could optimize it to handle contention a bit better maybe
It's insane seeing my thesis mostly compile it sub 4s ๐
The accelerator is perhaps also conceptually not ideal the way it is now
it used to take 68s with the old glossary ๐ฑ
I honestly don't understand it still, I get the rest but not the accelerator ๐ฌ
There is a lot of unnecessary contention because it's just a big HashMap that acts like like a bunch of small hashmaps.
is there a way we could "split" them?
by that I mean split the little hashmaps
When a value starts being tracked, it gets a new globally unique ID. And when something is validated on it, this validation is cached in the accelerator.
So conceptually, each Tracked owns a hash map in there
But I didn't want it to literally own because Tracked should remain Copy
Ok, why not move it into the Tracked then?
ah okay
I sent my message as your sent yours
could we point to an ID inside of a global list of accelerators?
it stays copy and on evict we clear that list of accelerators
it's a bit weird ngl but it kinda solves the issue of contention on that big hashmap
or we accept that Tracked is Clone ๐
since the IDs are ever-growing during execution, it would need to be a global list
Cloning it would split the acceleration, which is kind of pointless
the lists could be local to tracked types?
It would rather have to be handled through a reference
inside it could be an Arc<Mutex<HashMap<>>>
I don't like it
come on, one more generic bro, one more generic and it'll work bro, zero cost abstractions bro
or something like that ๐
Effectively, the accelerator is a leaky bump allocator with manual garbage collection and no reference tracking
currently, or in one of our ideas on how to re-implement it?
currently. we "allocate" a hashmap in the accelerator by using a specific ID, we only use increasing IDs and just collect during eviction. since it's only for perf, it doesn't matter if we have "use-after-free" (a tracked instance outlives evict) because it's just a validation miss.
bump allocator is maybe the wrong term, but for the IDs basically
Yes in my version, it would panic in a "use-after-evict" situation
(using the ID to reference and keep Copy)
BTW if that can reassure you, with your parallel version of Typst I can't really see the difference in the PDF, so good job ๐

with the totally broken branch without proper introspection you mean?
yeah ๐
lucky that your thesis apparently doesn't really need introspector disambiguation
that's surprising ๐ฎ
if you have lots of manually written stuff it is needed far less than when generating stuff
because the spans are part of the core hash that could be duplicate and thus needs disambiguation
I mean there are definitely a lot of quirks like some figures are randomly numbered, but overall it's hard to tell ๐
yeah, that's about what I would expect
you mean Vec<Map<.., ..>>?
yes
where we have this big vec pointing to individually locked maps
and you want to reset the ID in evict instead of ever-increasing it?
No idea, I haven't thought about it that far ๐
or there is the option of making Tracked Clone and using the Arc<Mutex<HashMap<>>> ๐
alternatively, it could also store a "head" offset
||just give it some head ๐||
trackeds from before evict would then not profit from acceleration
because if id < head -> None
yes
if id >= head -> list[id - head]
but, do we ever keep Tracked beyond an evict call?
I want comemo to have a fool-proof API
makes sense I suppose
I mean we could have a global Tracked counter
but it would require it to be Clone
(and panic on evict if any tracked values survive)
I don't follow
no, I mean a reference counter of all Tracked values
neither do I because it also means Tracked is Clone
yes
you could try whether the accelerator sharding gives any gains
sure
meanwhile, I need to figure something out for measurement
my gut says yes, but it's unclear
I want to get the parallel stuff merged soon. the measurement thing is the only remaining blocker.
To most members of the #contributors it looks like I'm doing complicated stuff, but I actually do the baby level stuff compared to what you're doing ๐
nah
I'm just dealing with the problems I created myself
it's crazy to me how invested (and obsessed) the contributor community overall is in the project. it makes me really happy.
I think you tapped into a deep desire to get rid of LaTeX, and Martin and you found some really GOOD solutions to some really REALLY hard problems
Especially regarding UX, and for that you should be 100000% proud of yourselves!
yeah, this is the most successful latex alternative so far
@left night you have no idea how painful it was to modify Typst to using local-accelerators ๐ญ
(without using .clone() everywhere)
So, it's not that much faster than my initial implementation, it catches up to yours in cold compilation, and it's slightly faster across the board by about another 5% (up to 10% in some of my incremental tests)
One change I did make is to allow &Tracked<T> to be used to remove most of the cloning
So the ergonomics isn't terrible
but it's not perfect either
It blows my mind how freaking fast this makes Typst
with the parallel export we'll be golden ๐ฎ
(since with parallel comemo we can also get parallel export across the board)
like we are approaching sub 3s territory here for compiling a 160 page thesis filled with figures, introspection/queries, and other stuff
it's mind blowing
O-M-G
@left night are you seeing this?
my thesis just compiled in sub 3s
HOW
JUST HOOOOOOOOOWWWWWWWWWW
@onyx furnace with the work @left night and I are doing, oi-wiki and without wasmtime (which is also coming afaik): 44s cold, with wasmtime we'd likely be looking at ~20s cold
BTW, this is done using mimalloc, which is a highly optimized allocator for parallel applications
And there are still a ton of micro-optimizations we can do like using hashbrown, using faster hasher for all of the hashmaps, etc etc etc.
yay!
doesn't std use hashbrown already?
does it? ๐ค
Maybe it does and I just misremember
all you youngsters that are used to the modern rust std lib
back in my days mutexes were slow, channels too, and so were hashtables ๐
hey, those were also my days!
I've been using Rust since 1.15 or sth like that
I started using rust and I remember when the 1.0 was released!
okay, I'm a youngster
(I am old, please help ๐ญ)
So, after the most performance productive day in Typst history (99% being due to your work) I am off to bed ๐ด
good night!
u too (sleep is important ๐ )
One last thing, compared to pre-content rework, that's a 5.6x to 6.2x improvement in incremental compilation time
needed to calculate that before going to bed to sleep well? :D
Yes! ๐
I have a big excel spreedsheet (which is a complete mess) so it's fairly easy
@tight glade look at what Laurenz has done ๐ฑ
oh wow that's crazy, that's so so impressive Oo
let me INVESTIGATE how the hell is this possible ๐ฎ
@sturdy sequoia you are so excellent that the bottleneck is going to move to browsers or pdf viewers and not longer at typst compilation. I never imagined a general comemo in parallel
parallel comemo was haaaaaaaard
especially since it's a rather highly optimized one too
did you make it happen?? ๐ฅณ
parallel comemo was me, parallel Typst is all Laurenz
mind you Laurenz suggested several comemo optimizations which are now in ๐
my god, you people are busy that's amazing โค๏ธ
that's too fast!๐
How many threads does it use?
@left night I pushed the changes for parallel comemo with local accelerators so that you can see what it ends up looking like
It would be nice, if the cli had an option on how many threads it is "limited" to use.
Like -j or test-threads.
yes, but that I think will come later
it's also needed for parallel image encoding
is it faster?
which currently isn't limited at all
yes
it's as fast as yours in cold compile, slightly faster incremental
did you also try the sharded global approach?
no not yet
you are using Mutex because it's always write, correct?
yes
but that's only because of the entry API
so no need for the overhead of a second AtomicUx in there
I wonder if RwLock with upgrade on miss would be faster
upgrade on miss?
when entry is none
basically read -> check hash map
if success -> return
if not in the map -> upgrade read to write
can we upgrade in a RwLock?
incurs double hashing but only on the slow path
I've also thought of just making the age atomic
then we don't even need mutability to lookup only when inserting
which should help too
I'm not sure what the performance implications of an upgradable read are
yes that could be a win, too
I'm first trying the method with an atomic, I used the orderings from Arc in std-lib for it hoping that it makes it faster than SeqCst
while the owned accelerator is in principle conceptually cleanest, it does make Tracked conceptually less nice, too. right now, it's is just a wrapper around a reference and TrackedMut around &mut. And both can be used like those w.r.t to copying, reborrowing, etc. having it be clone would ruin that.
@left night atomic based age provided another huge speedup of about 10% across the board!!!!
some incremental compiles are now sub 100ms (LSP territory)
wow
I also think that the vec-based global accelerator might actually be faster because it needs no atomic cloning ops. maybe it could even reuse the allocations for the accelerators.
yes I think you're right
I'll try that next ๐
@left night using the sharded accelerators is another 10% faster cold, but equivalent incremental (perhaps 1-2% faster)
my bad, ignore that I had made a mistake I just caught
I'll check again once it's done building
Well after fixing my mistake is equivalent in cold & hot compiles, but it keeps Tracked copy which imo is worth it
like re-using the hashmaps? then no
I'll see if it can be done
btw how much faster are the new changes compared to 0.10.0?
might have already been said, but it seems like you find new improvements every day and they add up :D
about 3x faster cold, 2-2.5x incremental
and that's without some of the other optimizations that I am cooking on the side
at this point, I think building the introspector is one of the main bottlenecks in incremental
this could possibly be fixed by integrating a hierarchical introspection API with layered caching directly into the frames
I've been thinking of that for a while
but we'd need to verify the assumption that the introspector is the slowest bit
we need the tracing ๐
the tracing no longer measures the introspection time :-p
(but it could be made to again)
Ok, so I have made typst significantly slower trying to re-use allocations
fun fact, it's faster without accelerators ๐
interesting
that was definitely not the case when I first introduced it
maybe that's because all tracked methods have become cheaper
world is more optimized and introspector, too
but I still find it hard to believe that it's really slower
it could also have to do with all of the goddamn locks everywhere on the accelerator ๐
with the sharded thing only, without it's faster
so:
- one global accelerator: slower cold, slightly slower incremental
- one accelerator in
Arcper-Tracked: fastest cold, fastest incremental - sharded accelerators: slightly slower cold, barely slower incremental (but doesn't work right yet)
I have a feeling that the upgradable rwlock is super duper slow
they all do
because in the sharded accelerators I still need mutable access to the one accelerator
so I .read() over the list of accelerators then .write() on the individual accelerator
and I .write() the list of accelerators when calling evict but that doesn't matter too much imo
ah
what you'd need is a lockfree bump allocator
but even if that exists, you could probably not clear it through an immutable reference
technically bumpalo is lockfree
so maybe that's the solution ๐ค
it's not Sync ๐ญ
Average rust generic signature
So will multi-threaded typst come out next update?
I mean considering you've done all you can to do less work the next step is being able to do more work per second
ok holy shit, but I guess it will be less on weaker CPUs, right?
Off to buy a 7995WX for typst
I don't think that many parts of typst could benefit from gpu acceleration
It was a half joke
Only image export
not really because while it uses many threads, the slow bits are still single threaded so as long as you have four cores or more
Which would be good for web app live preview
At least I can use all 16 of my cores now
yes, I have been thinking of writing a webgpu backend for Typst
Also typst-live
it's already the case in 0.10 for PDF export ๐
You see each core runs at 4ghz
yeah I got 4 cores on a 2019 thinkpad which I mainly use typst on^^
I have the 4ghz I WANNA USE THE 4GHZ
Sounds like quite the task
yes indeed
Probably yes
At this point you should just join the typst team as "that one performance guy"
I wish, I wouldn't need to interview for jobs anymore ๐
Working on OSS is one of my dream jobs
you've got three sponsors, can't you live off of that?!
Gotta pump those numbers up ๐
I have made 15,63โฌ so far ๐
@left night as far as I can tell, the best options are:
- local hashmaps in an
Arc - a single global hashmap
the sharding thing just... isn't working ๐ฆ
relatable ๐คฃ
Ok, I have recovered the lost performance ๐
@left night It's still slightly slower than a local hashmap, but it remains Copy
I think that happened to me recently but I couldn't figure out why and it fixed itself
yes, in this case it was tests running in parallel, but it's fixed now ๐
Ok, I managed to recover the last lil' bits of performance ๐
Not sure if this would work but would it be performant to write a comemo cache to disk and when typst starts again load that cache
For persistent caching
The cache is so big that reading and writing it to disk would likely be slower
Especially with all the recent optimizations
Interesting
I also think the issue is... Windows support? They were discussing some channel socket business, but windows was a drag.
Possible? yes absolutely, necessary? imo (and even more so with all of the tops coming) not so much. You have to realize that we cache lots of things which leads to ~3GB of use for my thesis (imo we probably cache too much), reading and writing to disk would be excruciatingly slow, and finally, we frankly don't need it: unless your document is really really big it will takes like ~20s to compile with these improvement (see oi-wiki with the improvements being massively down), so the further gains would be to maybe make that < 10s but with all of the complexities of disk caching
Additionally, there is the problem of deciding where to store them ๐ค
It's also not that simple technically
For instance, we're hashing a lot of pointers to statics, which can be affected by ASLR (Address Space Layout Randomization)
I didn't even consider that, but indeed, we'd need no hashing of addresses
and then there's also the fact that we have leaky interners and we hash the IDs used by them
it's a bunch of problems and I don't think it's worth it
definitely not imo
3gb ram usage sounds like death for lower power hardware
Hence why I said weโre probably memoizing too much
Yeah
Make it work, make it fast, make it work on hardware from the last decade, make it work on IE7, make it rain โ
I feel like the cutoff here is hardware from last decade
If typst can do that I think it's succeeded in being speed
If I can get it to run on my 15 yr old Lenovo laptop locally without performance issues I'll count that as a win
I don't mind memory usage if it has an actual purpose. Ideally it should scale based on the users amount of ram right?
I'm just making a fun remark. I think that memory usage is a complex thing to optimize. It will probably be a separate investigation.
๐ค
I think 8 GB machines is a reasonable target to make sure it works well on
I agree with this
I know a few people who still are running on 4GB
but yeah 8GB is reasonable
If we can get 4GB running well thats the best outcome imo
Sure. I'm sure it's possible to dynamically scale back comemo.
This is from the Steam hardware survey. Obviously biased towards higher ram, but it's a data point
I'd say that 8gb is probably the most practical minimum
It's anyway desirable for Typst to run well with 4GB max to itself (which is already a lot of course), since that's the WebAssembly limit
at least until memory64 becomes more prevalent ๐
but yeah I guess 4GB should be a sensible limit
Honestly, I would be surprised if Typst wasn't already running pretty well on old hardware
If you have 4 GB ram you're lucky if half that is available for applications
I feel that a best solution is to adopt a circuit breaker for memorization, because if we don't reach the memory limit, the more memory we utilize the faster compilation we get. When it detects the big document takes too much memory, it disables some memorization that has relative low efficiency to ensure experience.
It uses quite a lot of memory. Though I think a lot of the memory usage when using the web app comes from the preview canvas not being aggressively garbage collected
I didn't catch the "to itself"
||the memory limit after fixing the lil' mistake you guys at made ๐||
Maybe we could do this by allowing the "memoize" macro to take in a "level" of some kind that says how critical it is to memoize its result
Couldn't it be dynamically prioritized based on how often it's being accessed?
#naive
Sure, but I feel like finding a good heuristic for that might be heaps more difficult than... manually annotating functions with an optional level
No mistake
It was intentional because I wasn't sure whether setting 4GB max would fail in some browser, so I wanted to test with a lower limit first
Because the docs aren't clear about whether the full amount is always allocated up front when using shared memory
The cache evict strategy tends to behave like a smart (@sly pecan, e.g. dynamically prioritized based) generational GC, so I'm not sure whether we could also stole some techniques from them and still keep simplicity.
it's true that if we implemented with a "true" generational GC we might get better results? ๐ค
The difference is that we can free some object safely whenever we would like to, because all objects are just for caching computation, so we may not need to make a true generational GC.
Me when the thing to compile my code into pdf contains a generational GC for performance reasons
Typst is absurd in some ways and I love it
just checked the new parallel comemo pr and i dont quite understand what accelerator does. i guess i've missed a lot of disscussion. ๐can anyone tell me about that or improve the docs?
I just wanna bring up this message by @sturdy sequoia #contributors message Running plugins in parallel would be massive for cold compiles on documents that use them extensively
(Say a plugin that adds support for an image format for instance)
it makes validation across multiple instances of the same Tracked<T> faster if I understand it correctly
validation is the act of checking whether two runs have returned the same values, which Typst uses to check whether the doc has converged
"if I understand it correctly" isn't it your PR? ๐
Accelerators arenโt 
They were already there
oh, i thought it was new
Mind you, they improve performance a lot
so clearly they're needed ๐
It's a bit tricky. We can't expect wasm plugins to be compiled with atomics support. While we can fire up a separate instance per thread, that's gonna add overhead and be expensive for plugins that have costly one-time initialization. So it depends a bit on the case whether it's worth it.
I guess testing would be required, but if I had 100 calls to a plugin, each taking 0.1s, I'm guessing it might be worth it to run them in parallel
even with overhead
Perhaps plugins could somehow hint to their parallelism support
Plugins that really don't need it will just not ask for it, and those that do will have to ask for it and guarantee themselves that they won't break
it's true that we could just have a boolean in the plugin constructor
parallel: false by default
I think it could be more refined than that but that's definitely also an option
Also is there any way for plugins to directly generate PDF elements or Typst values?
I figured
other than the minimal api
no and that won't happen soon at the very least
and could become a real mess ๐
fair
yes but stability and the interface would be very hard
i think the way we have it now is a good balance, at least for now
for now yeah, it's a good MVP
adding typst values somehow would be more reasonable, but writing pdf directly would be pretty hard I think
I think a Frame interface would make more sense
where they can pass serialized frame and load them into Typst
And dangerous
indeed
noice
@sturdy sequoia I've reviewed the comemo PR
Yes I just saw, thanks
@left night regarding the last_was_hit feature in comemo, I made it default because tests can't specify their own feature ๐ฆ
Have you made sure that they are #[cfg(feature = "last_was_hit)] alongside being #[test]? It won't help, but atleast you won't get compile errors if the feature isn't specified..
(and frankly, I'd do that and make a test-all feature that can be used in CI / local testing that includes this features, and then remove it from default, just for the cleanness)
but then you dont test...?
I mean cargo test --all-features is definitely there for a reason, especially if your features are all additive, if not, then you'd make a feature say test-all that includes all additive features, and then it's cargo test --features test-all............
well, if that works then sure
they can specify required-features
it's a bit of a pain because you still need to pass them manually
but that's life
@left night I am done with the code review items, I am just testing whether the immutable: HashMap<...> is really necessary ๐
@left night (sorry for double ping) what would you replace it by, the old method?
or just always adding immutable calls? ๐ค
I can also test both
Probably the deduplication is still worth it
But hard to say
Since I don't want to overoptimize it for the Tracer usage, I would probably keep it
@left night As far as I can tell, on my thesis at least the two methods are equivalent, without the hashmap is barely faster (like 10ms on an incremental run of half a second) but it is slower in cold by 0.1s
So they're pretty much equivalent
And I just tested without dedup
and it's a good 30% slower than either methods ๐
I'll keep the "simpler" manual dedup method then ๐
@left night it's pushed
@sturdy sequoia does the non Send + Sync version really need a separate surface?
since the Send + Sync version forces the type to be Send + Sync anyway...
@left night I did it because the TrackedMut as the #ty passed to it:
So I wasn't really sure whether I could merge them

I see. Hmm, the duplication is unfortunate.
I agree, but I wans't sure how to circumvent it
Maybe a better alternative would be to either just require Send + Sync or allow the annotation to be #[comemo::track(unsync)]. in either case, I think we want to generate just one.
since we don't personally have a use case for unsync, we can also just skip it I guess
the return value must also be sync, so it's a bit moot
it's a bit annoying imo to have to write Tracked<dyn Trait + Send + Sync>
What I don't understand is that if trait Trait: Send + Sync { ... }, why do we need to specify it again
?

did you try that?
hum... I don't know ๐
I think so
but I can try again
maybe we don't, from a quick test
if we can just annotate the World trait Send + Sync, that'd certainly be ideal and no changes in comemo would be required (compared to main)
yes, that's what I like
from my quick test it seems to be the case
very nice!
Lemme double check before I make a fool of myself ๐
Tracked<'_, dyn World + Send + Sync> was indeed ugly
well it's not needed somehow
I guess though some of the other cleanups
@left night there are still some changes needed to World regarding interior mutability for World to be truly Send + Sync
(removing OnceCell, RefCell, etc.)
Those are on my branch
It's pushed ๐
And I confirm, typst works just fine by just swapping comemo and modifying the interior mutability (as well as some stuff that returns Self which isn't supported anymore ๐ )
I did some refactoring on the PR. In particular, I replicated the Inner optimization from the mutable constraints to the immutable constraints for less locking, split the thing up into a few more files, and switched to required-features for tests. Feel free to take a look / test it yourself before I merge.
It looks alright to me, just a warning when building in release which I fixed ๐
@left night as a final tally, this provides a 30-35% performance improvement in incremental ๐
I see imcoming commit want to make world Send + Sync thread safe. This may make some world implementations quite pain.
https://github.com/typst/typst/commit/5159c303133282c7d1c6567c1487d4f351d83eff#diff-fb076026c9166e521c4b4c70de26b2ed2c2052cec8e4df9edc5c506616417aaeR187
For example, a world contains JsValue will never be sync by design of wasm_bindgen.
๐ฅบ @sturdy sequoia idea?
I think the easiest in this way will be to have your js handling on a separate thread and use channels to communicate from the world to the js. That will keep world send and sync while not requiring js values to be send and sync
That may increase much cost to communication, but I think that's a general idea to allow non-sync things to be sync
๐ฟ still be pain to be honest, I may use unsafe impl Send before I have time to make a that sync wrapper...
@sturdy sequoia Another question, will multi-threaded comemo become slower if its user doesn't use any parallel thing like rayon? My some private projects (not related to typst) use comemo, but they are all run in single thread as intended.
No itโs in fact significantly faster
Sowwwyyyyyy, but it's needed to make Typst multithreaded ๐
Maybe you could just use a patch in Cargo.toml to force use the old version of comemo (single threaded)?
that's not really sustainable?
no indeed, but as a stopgap it should be fine?
you might want to check out the send_wrapper crate
merged ๐
not sure. some more testing might be sensible.
no worries. it's worth because multithreaded typesetting is definitely so cool! I don't know but I think latex is still single threaded.
LaTeX is extremely single threaded, yes
sounds like a good news. I've forgotten it. I'll give a try.
since we control comemo, I guess we can switch to a git dependency temporarily
I just don't want to use git deps for unreleased things that we don't control because it could mess with our releases
I think we could, the performance bump is worth it overall
most likely single threaded, after all writing multithreaded C code is hard
this is all that's needed, right? https://github.com/typst/typst/tree/parallel-comemo
don't forget to modify the test and bench world otherwise they won't pass CI
I can open a PR from my branch if you want
but CI passed?
I already adjusted the test world here: https://github.com/typst/typst/commit/cf6ce9fd53dae24ec46142e2c9b249cb4ae102b1
which is already on main
because the World impl in the CLI and in the tests use OnceCell from the stdlib
the bench world I didn't touch
ah right, I didn't know that
feel free to make a PR instead of me merging my branch
Then it looks good!
it's your work after all
Gimme five minutes then ๐
please use a git dep with rev instead of a patch though
otherwise typst can't be used as a library properly
It's still pretty broken, but the diff required for parallelity is just laughably small.
that's impressive!
BTW, as a quick update, compared to pre-content rework, we are at around 5x faster incremental and 2.5x faster cold
INSANE
Without parallel typst which brings that even higher
yeah ๐
I think there is still some perf on the table while remaining single threaded, but we've already done so much!
So hum... I have two hours to make slides ๐
oh no, godspeed
not a problem with all those performance improvements ๐
it's for a typst presentation too ๐
you lied 
I tried the recent changes and it's a major improvement again xD
Sowwy I canโt stop myself
Exposed.
So humโฆ I take it youโre happy with parallel comemo then?

yup yup, cold compilation time went from 1,3s to 800ms for my document (4 threads)
Wait, with main?????
yeah
From which version????
Okay, what cpu are you using?
Because the gains should be 30% incremental but pretty much zero cold
should be Intel Core i5-8265U
And is it better incremental ?
wait i'll measure it again
Benchmark 1: typst c document.typ
Time (mean ยฑ ฯ): 1.806 s ยฑ 0.018 s [User: 1.540 s, System: 0.623 s]
Range (min โฆ max): 1.787 s โฆ 1.849 s 10 runs
Benchmark 2: typst-dev c document.typ
Time (mean ยฑ ฯ): 1.028 s ยฑ 0.024 s [User: 1.247 s, System: 0.086 s]
Range (min โฆ max): 1.012 s โฆ 1.095 s 10 runs
Summary
typst-dev c document.typ ran
1.76 ยฑ 0.05 times faster than typst c document.typ
"typst-dev" is 41c0dae2, "typst" is just 0.10.0
that was a different document though (35 pages). I make a lot of use of states and some calculations
Simple edits for incremental:
- typst: 175ms - 240ms
- typst-dev: 150ms - 200ms
idk how I would easily pipe this into hyperfine though^^ just an estimate, so cold compile time definitely improved a lot more
@sturdy sequoia I haven't tried it yet, but do any of the recent developments also make wasm faster? I remember it being very slow, especially for cold compilations when it first came out
@left night is working on using wasmtime as the executor which is significantly faster
is there a branch to try?
I mean using it in the compiler is easy but integrating it into the web app is hard
Yes, in the cli itโs trivial, the problem only shows itself for the web app
I haven't really continued on that for now because of the wasm-bindgen Sync problem.
Well you could have the plugins running in a worker and use a channel to send messages to it
Should be fairly easy
Famous last wordsโฆ
That would still need world integration
Which I was hoping to avoid
It's also a bit unfortunate to unconditionally spawn evermore workers just in case although that will need to happen for rayon to work anyway
In short: I originally thought I could quickly use the browser wasm and just ship it in a day, but since it's more complex, I shifted it back in my priorities.
The web app just uses the wasm engine of the browser doesn't it?
No right now it uses wasmi
Which is why itโs so sloooooooow
I just assumed it would use the browser
I think the web app itself runs in the browser
But wasm plugins in the web app use wasmi
Node.js != the browser. I've learned this, is this true
Node.js has a ton of specialty APIs and misses tons of web-specific APIs
i found this. maybe it can help typst.ts and (webapp?)
A wrapper type that allows you to send !Send types to different threads.
We are speed
@left night Can you explain to me why you don't want the plugin interface to be part of World? Just curious as to what's the rationale here, it could be just a Sender<PluginCall> or something like that
(a) I always want to avoid complexity. In this case maybe it is necessary complexity, but it is still complexity.
(b) It is one more thing that world implementors have to worry about, at least unless it is implemented by default through a feature flag.
@low sapphire Since you like free performance: https://github.com/typst/typst/pull/2989
Nice!
why switch from std OnceLock to once_cell?
We're already using it elsewhere and it doesn't require unwrapping
that's it really
You should put "I made Typst N times faster" or something on your CV 
OnceLock doesn't require unwrapping?
you probably mean parking_lot vs std
No it does because by default it returns an Option<&T>
or am I confused? 
Ah no, the big difference is that OnceCell has the .wait() method
whoopsy
yeah, but for the CLI world that's not needed
Regarding Hash and PartialEq
Does it really happen only later for the pattern or immediately?
hmm, probably immediately so I could probably just .wait() it
in my initial work that's what I had done
and the deferred only happens in the first place because that's what construct_page happens to do?
yes
waiting directly seems cleaner
Done ๐
@left night do we have any CI for the CLI as a whole? Like testing the CLI itself not just the "engine" so-to-speak
No
Dherse has a github sponsor page up. We are three people that sponsor him.
You could join us, if you want.. ๐
Also there is sponsoring typst itself. ๐คทโโ๏ธ
@sturdy sequoia doesn't have to be part of this PR, but parallel cmap creation and font subsetting is probably also relatively low-hanging fruit.
I'll have a quick look ๐
Where should I look? ๐ค
create_cmap I suppose?
Just the subsetting would be simplest, but the cmap creation mutably borrows the glyph_set that's used by the subsetting, so doing cmap -> subsetting sequentially per font, but for all fonts at the same time, seems reasonable
This line might be a problem: https://github.com/typst/typst/blob/main/crates/typst-pdf/src/font.rs#L34
I don't think the glyph_set is used after that loop though, so it could probably be taken
@left night I doubt I did it how you wanted so here's what I did:
- First: I prepare the items by allocating all of the refs etc. on a single thread
- Second: I use a par bridge to go multithreaded where I do all of the font stuff
- Third: I finalize on a single thread to write to the PDF
does that makes sense or is it wayyyy too overengineered?
Is this description for the fonts or for the whole export?
of the fonts part
Why do you need to allocate the refs in the first part?
let processed_fonts = ctx.font_map
.items()
.map(|font| {
prepare_font(&mut ctx.alloc, &mut ctx.font_refs, &mut ctx.glyph_sets, font)
})
.par_bridge()
.map(process_font)
.collect::<Vec<_>>();
for processed in processed_fonts {
finalize(&mut ctx.pdf, processed);
}
This is essentially what happens
I had thought to just prepare the cmap and subset font and then merge First + Third
that also works indeed
maybe we should off for now
and instead design a proper architecture for the whole export
where we (likely) have indeed the three phases you described
but for everything
Okay, so I don't push the font changes
On my thesis, I cannot measure the change either way ๐
But I guess this would be more impactful on CJK stuff
that's curious, I would have thought that subsetting is quite expensive
but for CFF likely indeed much more than for TrueType
I mean overall PDF export isn't that slow
image encoding was extremely slow
but since it's been multithreaded it's very decent
in the naive parallel test I did, it was quite slow
but probably just because the test was naive and everything else was very fast
30000 page document or sth
yeah okay ๐
I mean right now it's only parallel across fonts, right?
not within a single font
so if a single font is slow it won't change much
yeah
the $10 option sounds tempting xD
Please feel free too ๐
doesn't seem to be faster though? for cold compilations at least
depends on the length of your doc!
I tried the same from last time
seems to be slightly slower on my hardware, but only by a few milliseconds
but if it speeds it up for other docs, I guess that's fine
How long is your doc and how many cores do you have?
35 pages, 4 cores
I just run a test with hyperfine --warmup 3 -m 100 ...
your PR is ~10ms slower on my laptop
for that payload
full of text, at least 30 images, couple of tables etc
it's slightly faster on the same template I used, but without any "content"
which has 5 pages
do I need to change anything to enable it? because I'm seeing no change at all really
I only checked out his branch and compiled it
the documents I've tested it on might have other bottlenecks
that being said, I only tried pdf
yeah svg is a bit faster
1.011 s vs 989.5 ms
might be even better on newer CPUs, idk
@sturdy sequoia by the way I'm seeing a performance increase with target-cpu=x86-64-v3
it's small, but there
like <1%
granted my testing was very haphazard
compared to what? ๐ค
The default is x86-64, which excludes many instructions
x86-64-v3 does however exclude CPUs older than approximately 10 years
Right, I had tried target-cpu=native but I didn't see many gains
x86-64-v3 should be similar to native
figures
Anyway, it's free real estate
but it breaks some compatibility
which I'm sure Laurenz won't like
(not that I care, if your CPU doesn't have AVX2 just throw it in the trash)
It would just be an additional target. Thats how mpv does it. You can choose to get v3
(although I can't imagine that many people are using pre-haswell computers today)
In developing countries? probably loads
Hey I have a librebooted x200 at home :D
Penryn is the code name of a processor from Intel that is sold in varying configurations as Core 2 Solo, Core 2 Duo, Core 2 Quad, Pentium and Celeron.
During development, Penryn was the Intel code name for the 2007/2008 "Tick" of Intel's Tick-Tock cycle which shrunk Merom to 45 nanometers as CPUID model 23. The term "Penryn" is sometimes used to...
๐ Naive question. What are you optimizing? Typst is freakingly fast for me.
Well, everything?
Is there anything that is slow in particular?
Several things can be faster, on large documents it can take quite a bit of time
I should also mention that you must think of it this way: every optimisation has a big impact on people on low end devices since theyโre constrained by the performance of their hardware
I think before dherse started, there were a few issues... Thesis long documents were quite slow to compile.
And it was too slow frankly.
Plus I've tried to start a paper with a long spurious png files in it, and performance suffered quite fast.
Indeed, I had performance issues before 0.8.0 and 0.9.0 release. But now most of it is gone. I wonder what is left to do. If png files are big, I wouldn't expect the compilation / rendering be fast anyway.
Yes, that was the content rework, but now we're mostly working on make Typst more parallel
Isn't it more like... You're working on making the rest of typst parallel?
Mostly laurenz
I made comemo parallel, but he's making the engine parallel
I have a long document that I converted to typst, that takes 18secs for typst and 12secs in latex on my old laptop on a clean first build so there is still some optimization potential
that seems like another good benchmarking candidate if you can share it or remove sensitive information
and maybe theres even some overuse of state and locate one can get rid of
If you could share it it would probably help a lot!
18s in 0.10?
Yes, but it is much faster on my newer system. It was more of a relative comparison to the latex times.
For sharing I would need either a more private setting or some cleanup time since we have solutions included. (which get integrated depending on a boolean flag)
When you say long document, do you mean one that is mostly text but thousands of pages?
No, there are also quite a few tables and graphics (some of which we converted from pdf to svg). The document is only ~100 pages long.
18 seconds sounds too slow for that, unless this is on a potato
Anyway, latex can be quite fast on "simple" documents (i.e. no fancy packages etc). Probably faster than typst cold compile can hope to be (at least single threaded)
why can't we hope to be faster?
because of more primitive text shaping in pdfLaTeX?
I mean, can't is a strong word, but tex is very well optimized. It's only when you start introducing "complicated" things that it slows down to a crawl.
That too
And typst presumably introduces extra overhead in order to be able to incrementally compile
Yes, that's true, although I don't think it's that much
I haven't tried with all the performance improvements, but pdftex was definitely faster on just pure text
Possibly even luatex, but I can't recall
Should try again some time
The text layout itself hasn't really be optimized yet in Typst
I mean some stuff like caching is_cjk etc. have helped a ton, but it's clear that there's room in there
https://github.com/frozolotl/typst-mutilate
Hope this could help
@sturdy sequoia when I upgrade comemo, I find this function is completely broken. ๐ซ Now I want to have a thread local comemo and a threaded comemo in same program.
I think it would be nice if we actually had that like #[comemo::memoize(local)] or something like that
Why does it break BTW?
the return type is not send
*It was ever mentioned in this forge. But I didn't replace comemo and get errors on comemo::memorize macros.
btw, Is comemo::evict weird if there are multiple threads that run different incremental tasks?
In typst preview, there is a incremental compile thread, which evicts comemo cache after each compilation, and there is a incremental export thread, which evicts comemo cache after each render task.
comemo evict evicts all caches, everything will still work, but it won't benefit from caching or acceleration
They doesn't seem to find a suitable timing to evict cache to both computation.
we really need a per-thread cache too then ๐
you should be able to evict both caches in their own time, the problem is that one will evict the current cache of the other thread
all of the Tracked<T> will still work, they'll just be slower
you are right, comemo is safe to evict at any time, but that looks a bit weird.
I admit it looks a bit weird, but I designed it (with Laurenz) specifically such that you can call it anytime and everything would still work
Essentially since function only memoize once they reach the output, evicting doesn't affect them.
and for accelerators in tracked structs, it just disables accelerations for outdated tracked
this is because it is a fairly dumb generational GC with a max generation of 1 for accelerators
So how can I go it on ๐.
I wonder if we could have a #[comemo::memoize(local)] too, it shouldn't be too hard
I'll see if I can do that ๐
So we will have an evict and an evict_local?
yes that's how I am designing it
@sturdy sequoia If we borrow design ideas from previous work, to make threaded comemo really match rust's thread model. I think tokio's runtime handler design is nice to check. they store the a lightweight metadata in thread local. Then, a thread function can easily creates an isolated async handler, enters that handler, or exits that handler. The handler is also sendable among threads so that they can group threads into different async isolates.
Similarly, comemo can have a Handler to create. And we group threads and caches by handlers.
@untold turret
that could be an idea, but it feels overly complicated for right now ๐
Oh that's a bit yucky.
๐ฑ having local is great enough. In preview scenario, A single compiler uses evict and rest render tasks uses evict_local.
Can't you just &'static mut ()
The only problem @untold turret is that you cannot pass Tracked into locally memoized functions ๐
But lsp may have multiple compilers.
evict and evict_local are completely separate
so they won't ever interfere
get it.
that's because I know raw pointers are never Send afaik
is that a problem?
Not a problem till now.
Oh right mut pointers are send if their payload is sync, which () is probably
let me check that
I think they are never send.
mut references can be send, mut pointers never are
yup seems so
I guess because mut pointers are Copy
Sad that we can't just
struct NonSend;
impl !Send for NonSend {}
in stable yet.
yeah you need to use phantom data with pointer or w/e
I guess you could &() as &dyn Any
Or you'd probably need like Eq or something right?
@untold turret I did find a way of passing Tracked inside of a local memoized function!
Well it might look something like this
#[test]
fn test_unsend() {
trait OpaqueNonSend: PartialEq<()> + Debug {}
impl OpaqueNonSend for () {}
#[comemo::memoize(local)]
fn add(_a: u32, _b: u32) -> &'static dyn OpaqueNonSend {
&()
}
test!(miss: add(1, 2), &());
test!(hit: add(1, 2), &());
test!(miss: add(2, 3), &());
test!(hit: add(2, 3), &());
}
I don't know if this is really better with all the trait mess
I mean it's just a test for CI at the end of the day it doens't matter too much :-p
how do you do that?
#[comemo::memoize(local)]
fn dump<'local>(mut sink: TrackedMut<'local, Emitter>) {
sink.emit("a");
sink.emit("b");
let c = sink.len_or_ten().to_string();
sink.emit(&c);
}
You must manually annotate the lifetimes specifically with the name 'local!
Internally it uses substitutes the 'local for 'static
it's a bit ugly but it works ๐
len_or_ten?
it'a a thing from the tests nevermind that
oh ok
@untold turret https://github.com/typst/comemo/pull/6
You mean we cannot elide a '_ lifetime?
no :/
internally it does the following:
type __ARGS<'local> = <::comemo::internal::Args<(TrackedMut<'local,Emitter> ,)> as ::comemo::internal::Input>::Constraint;
It's a limitation of how thread_local in std resolves lifetimes, it's weird because it should work but it doesn't...
Probably running in a weird edge case in rustc
thread local is undoubtly a edge case. We have... many crates to loose our coding live.
the thing is that in the non-local version, we don't need to annotate lifetime and it still resolves just fine
but in the local case, where it goes through the thread_local macro we do
so it's really dumb
@glad urchin in tablex, there is this function:
// Measure a length in pt by drawing a line and using the measure() function.
// This function will work for negative lengths as well.
//
// Note that for ratios, the measurement will be 0pt due to limitations of
// the "draw and measure" technique (wrapping the line in a box still returns 0pt;
// not sure if there is any viable way to measure a ratio). This also affects
// relative lengths โ this function will only be able to measure the length component.
//
// styles: from style()
#let measure-pt(len, styles) = {
let measured-pt = measure(box(width: len), styles).width
// If the measured length is positive, `len` must have overall been positive.
// There's nothing else to be done, so return the measured length.
if measured-pt > 0pt {
return measured-pt
}
// If we've reached this point, the previously measured length must have been `0pt`
// (drawing a line with a negative length will draw nothing, so measuring it will return `0pt`).
// Hence, `len` must either be `0pt` or negative.
// We multiply `len` by -1 to get a positive length, draw a line and measure it, then negate
// the measured length. This nicely handles the `0pt` case as well.
measured-pt = -measure(box(width: -len), styles).width
return measured-pt
}
Could that not just be using the calc.sign and calc.abs to measure only once? I'm asking because the measure calls really add up
Maybe that would be better handled if we had a len.resolve(styles) method, that way you could obtain an absolute size from a relative one (i.e ems)
@left night Do you think we could have this: length.resolve(styles) it would also take the styles argument (just like measure) but work directly with a length as a much cheaper alternative to relying on measure here:
/// Resolve this length to an absolute length.
#[func]
pub fn resolve(
&self,
/// The styles with which to measure the length.
styles: Styles,
) -> Length {
let styles = StyleChain::new(&styles);
Length { abs: self.abs + self.em.resolve(styles), em: Em::zero() }
}
It's a dead simple function too
You could do measure(stack(dir:ttb, box(width: len), box(width: -len))).width
That would give the absolute value of the length I believe
But I haven't actually tried...
@sturdy sequoia
That's a pretty good idea!
Still I think having a resolve function makes a lot of sense @sly pecan
Sure
Although I get with the whole context thing we might have it be easier
It was proposed before. Maybe there's no harm in adding it now. I just don't want to add more and more things to the old system before doing contexts.
But since it will all be breaking anyway, there's probably not much harm in it.
Well, in the PR that added this, I wondered if this could have been a problem for performance, but also we didnโt really know how to handle it
Also the second measure technically only happens with 0pt measurements anyway
Well, <= 0pt
So it should be relatively rare
But still thereโs no way to convert em to pt otherwise atm
It could be made a bit more efficient by just checking em and doing sign and stuff yeah, but Iโd have to restrict that to Typst 0.11.0 since thatโs what we can currently โdetectโ inside tablex code
Either way we can discuss this more later but the main problems are the technical limitation of using measure and version compatibility between different approaches
makes sense yeah
btw #1175895972090499112 message

calc.sign doesnt exist
;-;
Waaaaaaaaaaaa
i think you replied to the wrong message lol
I did
anyway np i made my own sign function
it's glorious
(from stackoverflow of course :D)
Based
@left night do you have any opinions on creating a built-in calc.sign? ๐
No opinion
oh
I'm not a huge fan of calc overall though
I wish it were methods on a number type
That can also be called as num.func
honestly, i have thought the same thing multiple times lol
but ig we'd still have to discuss that and consider all viewpoints and talk to the community and laziness is superior
๐
maybe such a larger-scale change can come with the type rework in the future
having both (int/float).sign sounds great
I'm not sure whether this would make sense with distinct int and float types though
well, we'd probably have to implement the same methods on both somehow
but personally I'm in favor of keeping distinct types, using int at least you have no chance of getting precision errors or stuff like that. seems fair enough to me
not to mention that there are indeed situations where the distinction is important (e.g. some-array.at(5.5) would error)
but thats a bit off-topic for this thread so we can probably discuss this further elsewhere if needed
:p
well, it could error still
yeah, that of course. but if you want to write a generic function you'd need to pick one (or use method notation which is sometimes ugly)
and e.g. float.sqrt would mostly work, but float.min would yield a different type
maybe it's not a big problem
well yea, float.min would have to return some enum
i guess in the end there is some value to calc for those multi-type things. cuz I mean, .min could also make sense for length types for example
so it's a bit hard to decide on the best way to proceed here
also, btw : https://github.com/PgBiel/typst-tablex/pull/106 ๐
Analysis of Arc vs String cloning performance in contentended scenarios.
Adds the .resolve(styles) method to length to allow retrieving a length as absolute (without the em part) when in the right context.
This changes fixes packages like tablex (https://github.com/PgBi...
Very epic
Thanks ๐
I think length.resolve(styles) would probably just become length.resolve() or similar with the context proposal so that's cool
yes, I think the change would be relatively small once we have contexts
@sturdy sequoia when I upgrade to new comemo, the performance degrades a lot on small documents.
// baseline (before)
https://github.com/Myriad-Dreamin/typst.ts/commit/71fccd6814be2a1daeea01f23b2567c9fbf5a484
typst_ts_bench_lowering fastest โ slowest โ median โ mean โ samples โ iters
โโ lower_cached 293.8 ยตs โ 1.493 ms โ 339.6 ยตs โ 395.9 ยตs โ 100 โ 100
โโ lower_incr 10.73 ms โ 17.19 ms โ 11.66 ms โ 11.89 ms โ 100 โ 100
โโ lower_the_thesis 128.6 ms โ 280.2 ms โ 143.8 ms โ 152 ms โ 100 โ 100
โฐโ lower_uncached 1.097 ms โ 2.142 ms โ 1.214 ms โ 1.263 ms โ 100 โ 100
// with threaded comemo (after)
https://github.com/Myriad-Dreamin/typst.ts/commit/52ae23bd8003c73650956148c7a12a29f32f9fc9
typst_ts_bench_lowering fastest โ slowest โ median โ mean โ samples โ iters
โโ lower_cached 326.9 ยตs โ 3.153 ms โ 419.3 ยตs โ 460.9 ยตs โ 100 โ 100
โโ lower_incr 23.93 ms โ 33.19 ms โ 25.98 ms โ 26.56 ms โ 100 โ 100
โโ lower_the_thesis 127.7 ms โ 149 ms โ 134.9 ms โ 135.6 ms โ 100 โ 100
โฐโ lower_uncached 9.389 ms โ 18.36 ms โ 10.07 ms โ 10.48 ms โ 100 โ 100
Among 93 seconds bench time, there are 60 seconds of time spent on mutex/atomics.
๐จ all top 10 time-consumed functions are mutex.
But I'm tired on editing thousands of lines of code broken in lastest typst, I may take a rest and come back to explore more how to rescue the performance degrading.
what platform are you running on?
I'm on a laptop windows amd64
Some degradation was to be expected, but that much contention isn't normal ๐ค
There just isn't anything multithreaded enough
what export target?
My svg export tho
I have tried to add rayon to make it parallelized in page level or each frame. the result from vtune shows that it doesn't help
But I should write a cache manually to avoid using threaded comemo for my export with latest typst, to check which parts of code cause performance degrading in end
What's going on with performance stuff rn?
nothing ๐
I have been on a bit of a break
Ah
Is it crazy that changing a SINGLE .unwrap() call to an unwrap_unchecked call saves 6% of the total execution time of typst????
Daring today, aren't we
๐
which check did you skiped?
Then you're bypassing safety aren't you?
I don't know rust, but unchecked sounds dangerous!
Well to be fair, we check the condition right before calling unwrap
means its some path where the compiler cant statically analyze or optimize the unwrap i suppose
fn make_mut(&mut self) -> &mut dyn NativeElement {
let arc = &mut self.0;
if Arc::strong_count(arc) > 1 || Arc::weak_count(arc) > 0 {
*arc = arc.dyn_clone();
}
unsafe {
// Safety: We ensured the content is not shared.
Arc::get_mut(arc).unwrap_unchecked()
}
}
what happens if the condition fails?
we clone the arc meaning we create a known-unique arc
so no matter what, we know that we can Arc::get_mut safely ๐ค
dumb question: can there be some race condition where it changes from another thread between the check and the unwrap?
probably not? ๐ค
I mean I don't want to introduce unsafe
I just thought it was funny
further dumb question: can you unwrap without the check beforehand, and instead possibly handle the error after instead?
if unwrap does the check anyway
no there's some borrow checker limitations ๐ฆ
shouldn't the compiler be smart enough to optimize away the check if it's possible?
since there's atomic operations involves, it's unlikely
I don't know, I'm just struggling to find new avenues for optimization ๐ฆ
Maybe it's mostly fast enough? ๐
for another thread to do stuff, it needs to have a ref to the arc itself. in that case the strong count would be > 1. so after the first part of the if condition, it's clear that we are the only one with a strong ref. then we check for weak refs (a weak ref can't introduce a strong one) and there are also none. I'd have to ponder a bit more to be really sure with the weak refs, but generally this kind of check is safe.
and for 6% perf across the board, it would honestly be worth it. for 6% in some specific doc, maybe not.
Actually no, because it could also upgrade a weak reference
but it could happen that a weak reference gets upgraded between the time you checked the strong ref count and the weak
that's why Arc::is_unique does some magic stuff to prevent this scenario, unfortunately it's a private method
ah okay, yeah as said with weak I'm not 100% sure how it works
but with just strong, the reasoning works
I found it because I wanted to see what was the hit rate of Content::get_mut, and whether it needed to clone all of the time, which it does clonse in about 35% of the time (on my thesis) and clones 55% of the time in incremental
Because I feel that cloning (and the many atomic operations that come with it) is likely still, I believe, a bottleneck
It clones so much because it sets the guard field
yes
I am (right now) working on extracting the guard, label, location etc. from the individual elements into a generic Packed<T>
This is necessary to be able to merge Content and Value
Which is why I was thinking that if we could extract the guards it could work out really well
Why tho? ๐ค
Because e.g. a Color that ends up in the doc also needs to be able to have guards, labels, locations
Ideally I think Content should generally be immutable with the exception of synthesized fields
ah right
I'm making good progress on this and will likely open a PR early next week
That's exciting ๐ฅ
Wow nice ๐ฅ
I wonder how it will improve perfs ๐ค
Or degrade them 
For now, I've kept everything in a single Arc (basically Content(Arc<Inner<dyn NativeElement>>) where Inner holds the stuff. I just want to focus on extracting it and keeping all tests green. The, once Value and Content are merged, I want to optimize the representation a bit.
E.g. an int shouldn't be stored as Arc<dyn NativeType>. That's horrible.
Ok, I think that's a good first step
(The built-in enum will go out the window.)
Not the built-in enum! I rely on it for typstfmt! /Jk
And I want to make Packed<T> coercable into Shared<T> (just a shared ref without the metadata). Then we can use Shared<Gradient> in the output and don't have to deal with Arcs internally for every type to keep it small and cheap to clone.
Shared might just be Arc. Not sure. Perhaps also a new nice type that saves a lazily initialized hash alongside the ref count. That'd be neat.
The Value enum?
/joke
Maybe there is a way where this could use a cheaper non-weak supporting Arc?
like the one from triomphe?
If it would store a hash, it would need to be hand-rolled anyway.
Probably at least
I would probably also write it in a way where the Arc pointer points directly to the data and the header is in front of it. That way Deref is a no-op.
EcoVec does that
oh damn then you could do stuff like configuring how floats are formatted?
yes, after content and value are merged
once there are get rules you could then do something like get the current language and format the float depending on that right?
yes
awesome ๐
as long as you don't format the float into a string eagerly
it basically needs to stay a float and end up in the content tree
hmm how would you format it without turning it into a string though?
In the float show rule you can turn it into a string. But if you do [#str(5.0)] then that string conversion can't be language aware and the float show rule will never run.
ahh gotcha!
absolutely, yes
@left night I've done some preliminary testing on your PR and here are some notes
- While compiles (cold & hot) are pretty unaffected (~2% slower),
- Separate arcs for the metadata and dyn trait does give some gains, especially in incremental
- Using a box is worse
- inlining the value all into the
Valueis actually not nearly as bad as one might expect, being even faster in some incremental tests!
- Overall, I like the change, and I think it opens the door to better performance and API (Value = Content rework) in the future
What I am mostly curious is if we can use this to make Content<T> where T defaults to dyn Bounds such that a lot of elements could instead of having a Content, or a RawLine could have a Content<RawLine> which would avoid needing to reallocate.
I also think we could try to have the dyn Bounds be Prehashed to gain a bit of performance and remove all other prehashed
Could you clarify what you mean with that?
This one I also don't really understand. Isn't that more or less Packed<T>?
no my idea is to keep the Arc but specialize it (which you can do without re-allocating it) to avoid doing cloning in a few elements
like having one bigger struct where all of the elements are in the stack and only the dyn bounds is in an Arc
ah, basically making Value huge?
yep
but Packed<T> also doesn't reallocate?
OOOOOOHHHHHHHHHHHHHHHHHHHH
Right
you're right
It is basically exactly what you describe, just layout-compatible with Content
Which is nice because we can reinterpret a &Content as a &Packed<T>
I guess Content<T> might also have worked instead of Packed<T> with #[repr(transparent)] but what I like is that it doesn't depend on any Rust type system coercion foo.
It leaves flexibility for how things are implemented.
Yes you're right
I have a half-finished blog post with a bunch of ideas on how to implement an efficient Value encoding that also supports user-defined types. Unfortunately, it is half finished :(
