#Performance

1 messages ยท Page 2 of 1

left night
#

I can imagine very well

#

Pretty similar to pitching a startup to investors :)

#

I really think it is. And it's totally awesome that it's almost trivial due to purity.

#

Try doing that with LaTeX!

sturdy sequoia
#

And rust really does make that a breeze imo

#

just the Deferred abstraction being this easy kinda blows my mind

#

fearless concurrency for sure!

#

@left night I can already tell you: deferred compression makes a big difference

#

I'm not done testing but it looks very promising

sturdy sequoia
#

@left night do you know what might cause comemo to randomly miss caches that should hit? ๐Ÿค”

sturdy sequoia
left night
#

no I mean in which memoized function specifically

#

or do you mean everywhere?

sturdy sequoia
#

in comemo's tests for example

left night
#

and this happens just with your changes you mean? or do you mean that you found a bug in comemo?

left night
#

then, it's really hard to say

#

anything could have that effect

#

wrong constraints for instance

sturdy sequoia
#

@left night pretty sure I know why, I'm using a hashmap instead of an IndexMap so ordering is random ๐Ÿคฆโ€โ™‚๏ธ

#

Well no

#

goddamn

sturdy sequoia
#

Ok, I determined that it's an issue with the accelerator ๐Ÿ’€

#

No, it just happens less without the acceleraotr

#

GODDAMN BUG

#

HEISENBUGS ARE THE WORSE

glad urchin
sturdy sequoia
sturdy sequoia
#

Ok, the plot thickens: if I run only one test, it never fails (I literally wrote a bash script that runs the test until it fails)

#

But if I run multiple tests then one of them sometimes fails

#

weird

#

very weird

#

Ok, it's eviction that seems to cause issues

#

OF COURSE

#

GOD DAMNIT

#

Tests are un in parallel

#

so it can evict while another test is running

#

๐Ÿ˜‚

#

Since it's not thread local anymore

#

Bingo, that was it

#

GODDAMN that was an annoying bug to track

left night
#

makes sense

sturdy sequoia
#

@left night could you test your parallel layout with this version of comemo please ๐Ÿฅฐ

#

I wonder if it's any faster

#

To get back to just Deferred for the PDF stream compressions, the gains are... meh, cold they're pretty good: around 6%, but incremental they're basically non-existent because the deflated streams are already mostly memoized anyway ๐Ÿ˜ฆ

#

Mind you larger docs will see larger gains

#

(can you send me that lorem test doc btw?)

left night
feral imp
#

@sturdy sequoia can you like almost guess what the performance benefit would be to native with Dash Map?

left night
#

this is with the threaded branch of the comemo repository

#

this is with your pull request (+ one line addition to the macro to make it compile. it needs Send + Sync on the trait impl).

#

One difference I'm seeing is that the my threaded still uses DashMap because I've written that before I knew about the Safari issue.

sturdy sequoia
#

I'm guessing the contention on the locks is very heavy

#

I'll test it locally and see with a DashMap 'cause I'm pretty sure it will be a fair bit faster

left night
sturdy sequoia
left night
sturdy sequoia
#

It's late

#

@left night just so you know, in my slowest incremental test it divides (with my comemo) incremental time by four

#

@left night you're also using parking_lot whereas I'm still using std locks

#

I still think parking_lot is better because they don't poison!

#

So, your version wins on cold compiles but gets absolutely demolished on incremental thanks to all of my other optimizations, on my end you're slightly faster cold by about 5% but incremental your version of comemo is usually ~25% slower than mine.

#

pardon the mouse handwriting ๐Ÿ˜‚

#

๐Ÿ˜Ž

#

Mind you, compared to main, both of these destroy main, my version of comemo with your parallel code leads to 66% faster cold, and between 60 and 242% faster incremental

#

parking_lot on my version makes no difference outside of removing all of the ugly .unwrap() calls

#

And actually, it makes sense that your version is faster in cold: in incremental there is less contention over the locks, and while my version suffers from lock contention, your does not (or not as much).

left night
#

did you try yours with DashMap?

sturdy sequoia
#

I also think that by not having functions race, we might get better performance (i.e by having a built-in Deferred<T>)

sturdy sequoia
#

I am doing the change now

#

but it's quite in-depth since I have RwLock everywhere

left night
#

compile e.g. has no non-tracked argument, so it would allow one compilation at once

#

which is maybe not a problem for our use case but strange conceptually

sturdy sequoia
#

ah, you might be right indeed, my idea was to push it early to the cache and have the cache store a Deferred<T> essentially as the output

#

Just making the ACCELERATOR use a DashMap bumps performance on-par with yours in cold

#

It does make incremental compilation slightly slower

#

so it's kind of a tradeoff between cold & hot compile times

#

mind you I'm pretty sure I could optimize it to handle contention a bit better maybe

#

It's insane seeing my thesis mostly compile it sub 4s ๐Ÿ˜‚

left night
#

The accelerator is perhaps also conceptually not ideal the way it is now

sturdy sequoia
#

it used to take 68s with the old glossary ๐Ÿ˜ฑ

sturdy sequoia
left night
#

There is a lot of unnecessary contention because it's just a big HashMap that acts like like a bunch of small hashmaps.

sturdy sequoia
#

by that I mean split the little hashmaps

left night
#

When a value starts being tracked, it gets a new globally unique ID. And when something is validated on it, this validation is cached in the accelerator.

#

So conceptually, each Tracked owns a hash map in there

#

But I didn't want it to literally own because Tracked should remain Copy

sturdy sequoia
#

Ok, why not move it into the Tracked then?

#

ah okay

#

I sent my message as your sent yours

sturdy sequoia
#

it stays copy and on evict we clear that list of accelerators

#

it's a bit weird ngl but it kinda solves the issue of contention on that big hashmap

#

or we accept that Tracked is Clone ๐Ÿ˜

left night
#

since the IDs are ever-growing during execution, it would need to be a global list

left night
sturdy sequoia
left night
#

It would rather have to be handled through a reference

sturdy sequoia
left night
#

I don't like it

sturdy sequoia
#

or something like that ๐Ÿ˜‚

left night
#

Effectively, the accelerator is a leaky bump allocator with manual garbage collection and no reference tracking

sturdy sequoia
left night
#

currently. we "allocate" a hashmap in the accelerator by using a specific ID, we only use increasing IDs and just collect during eviction. since it's only for perf, it doesn't matter if we have "use-after-free" (a tracked instance outlives evict) because it's just a validation miss.

#

bump allocator is maybe the wrong term, but for the IDs basically

sturdy sequoia
#

Yes in my version, it would panic in a "use-after-evict" situation

#

(using the ID to reference and keep Copy)

#

BTW if that can reassure you, with your parallel version of Typst I can't really see the difference in the PDF, so good job ๐Ÿ‘

left night
left night
#

lucky that your thesis apparently doesn't really need introspector disambiguation

sturdy sequoia
left night
#

if you have lots of manually written stuff it is needed far less than when generating stuff

#

because the spans are part of the core hash that could be duplicate and thus needs disambiguation

sturdy sequoia
#

I mean there are definitely a lot of quirks like some figures are randomly numbered, but overall it's hard to tell ๐Ÿ˜‚

left night
#

yeah, that's about what I would expect

left night
sturdy sequoia
#

where we have this big vec pointing to individually locked maps

left night
#

and you want to reset the ID in evict instead of ever-increasing it?

sturdy sequoia
#

or there is the option of making Tracked Clone and using the Arc<Mutex<HashMap<>>> ๐Ÿ’€

left night
sturdy sequoia
left night
#

trackeds from before evict would then not profit from acceleration

#

because if id < head -> None

sturdy sequoia
left night
#

if id >= head -> list[id - head]

sturdy sequoia
#

but, do we ever keep Tracked beyond an evict call?

left night
#

I want comemo to have a fool-proof API

sturdy sequoia
#

I mean we could have a global Tracked counter

#

but it would require it to be Clone

#

(and panic on evict if any tracked values survive)

left night
#

I don't follow

left night
#

ID

sturdy sequoia
#

no, I mean a reference counter of all Tracked values

left night
#

ah you mean

#

like Arc

sturdy sequoia
#

yeah

#

and then panic on evict if there are Tracked values alive

left night
#

I don't like that either

#

evicting while tracked is sound in principle

sturdy sequoia
#

neither do I because it also means Tracked is Clone

sturdy sequoia
left night
#

you could try whether the accelerator sharding gives any gains

left night
#

meanwhile, I need to figure something out for measurement

sturdy sequoia
#

my gut says yes, but it's unclear

left night
#

I want to get the parallel stuff merged soon. the measurement thing is the only remaining blocker.

sturdy sequoia
left night
#

nah

#

I'm just dealing with the problems I created myself

#

it's crazy to me how invested (and obsessed) the contributor community overall is in the project. it makes me really happy.

sturdy sequoia
#

Especially regarding UX, and for that you should be 100000% proud of yourselves!

glad canyon
#

yeah, this is the most successful latex alternative so far

sturdy sequoia
#

@left night you have no idea how painful it was to modify Typst to using local-accelerators ๐Ÿ˜ญ

#

(without using .clone() everywhere)

#

So, it's not that much faster than my initial implementation, it catches up to yours in cold compilation, and it's slightly faster across the board by about another 5% (up to 10% in some of my incremental tests)

#

One change I did make is to allow &Tracked<T> to be used to remove most of the cloning

#

So the ergonomics isn't terrible

#

but it's not perfect either

#

It blows my mind how freaking fast this makes Typst

#

with the parallel export we'll be golden ๐Ÿ˜ฎ

#

(since with parallel comemo we can also get parallel export across the board)

#

like we are approaching sub 3s territory here for compiling a 160 page thesis filled with figures, introspection/queries, and other stuff

#

it's mind blowing

#

@left night are you seeing this?

#

my thesis just compiled in sub 3s

#

HOW

#

JUST HOOOOOOOOOWWWWWWWWWW

#

@onyx furnace with the work @left night and I are doing, oi-wiki and without wasmtime (which is also coming afaik): 44s cold, with wasmtime we'd likely be looking at ~20s cold

sturdy sequoia
# sturdy sequoia O-M-G

BTW, this is done using mimalloc, which is a highly optimized allocator for parallel applications

#

And there are still a ton of micro-optimizations we can do like using hashbrown, using faster hasher for all of the hashmaps, etc etc etc.

left night
left night
sturdy sequoia
#

Maybe it does and I just misremember

#

all you youngsters that are used to the modern rust std lib

#

back in my days mutexes were slow, channels too, and so were hashtables ๐Ÿ˜‚

left night
#

I've been using Rust since 1.15 or sth like that

sturdy sequoia
#

I started using rust and I remember when the 1.0 was released!

left night
#

okay, I'm a youngster

sturdy sequoia
#

(I am old, please help ๐Ÿ˜ญ)

sturdy sequoia
#

So, after the most performance productive day in Typst history (99% being due to your work) I am off to bed ๐Ÿ˜ด

sturdy sequoia
#

One last thing, compared to pre-content rework, that's a 5.6x to 6.2x improvement in incremental compilation time

left night
#

needed to calculate that before going to bed to sleep well? :D

sturdy sequoia
#

I have a big excel spreedsheet (which is a complete mess) so it's fairly easy

sturdy sequoia
tight glade
#

oh wow that's crazy, that's so so impressive Oo

#

let me INVESTIGATE how the hell is this possible ๐Ÿ˜ฎ

untold turret
#

@sturdy sequoia you are so excellent that the bottleneck is going to move to browsers or pdf viewers and not longer at typst compilation. I never imagined a general comemo in parallel

sturdy sequoia
#

especially since it's a rather highly optimized one too

tight glade
sturdy sequoia
#

mind you Laurenz suggested several comemo optimizations which are now in ๐Ÿ™‚

tight glade
#

my god, you people are busy that's amazing โค๏ธ

sly pecan
#

How many threads does it use?

sturdy sequoia
#

@left night I pushed the changes for parallel comemo with local accelerators so that you can see what it ends up looking like

feral imp
#

It would be nice, if the cli had an option on how many threads it is "limited" to use.

#

Like -j or test-threads.

sturdy sequoia
#

it's also needed for parallel image encoding

sturdy sequoia
#

which currently isn't limited at all

sturdy sequoia
#

it's as fast as yours in cold compile, slightly faster incremental

left night
#

did you also try the sharded global approach?

sturdy sequoia
left night
#

you are using Mutex because it's always write, correct?

left night
#

but that's only because of the entry API

sturdy sequoia
#

so no need for the overhead of a second AtomicUx in there

left night
#

I wonder if RwLock with upgrade on miss would be faster

sturdy sequoia
left night
#

when entry is none

#

basically read -> check hash map

#

if success -> return

#

if not in the map -> upgrade read to write

sturdy sequoia
#

can we upgrade in a RwLock?

left night
#

incurs double hashing but only on the slow path

sturdy sequoia
#

I've also thought of just making the age atomic

#

then we don't even need mutability to lookup only when inserting

sturdy sequoia
#

which should help too

left night
#

I'm not sure what the performance implications of an upgradable read are

left night
sturdy sequoia
left night
#

while the owned accelerator is in principle conceptually cleanest, it does make Tracked conceptually less nice, too. right now, it's is just a wrapper around a reference and TrackedMut around &mut. And both can be used like those w.r.t to copying, reborrowing, etc. having it be clone would ruin that.

sturdy sequoia
#

@left night atomic based age provided another huge speedup of about 10% across the board!!!!

#

some incremental compiles are now sub 100ms (LSP territory)

left night
#

I also think that the vec-based global accelerator might actually be faster because it needs no atomic cloning ops. maybe it could even reuse the allocations for the accelerators.

sturdy sequoia
#

I'll try that next ๐Ÿ™‚

#

@left night using the sharded accelerators is another 10% faster cold, but equivalent incremental (perhaps 1-2% faster)

#

my bad, ignore that I had made a mistake I just caught

#

I'll check again once it's done building

#

Well after fixing my mistake is equivalent in cold & hot compiles, but it keeps Tracked copy which imo is worth it

left night
#

Agreed

#

Did you reuse the allocations or not?

sturdy sequoia
left night
#

yeah that

#

just clearing all of them instead of the vec

#

and resetting the head

sturdy sequoia
#

I'll see if it can be done

low sapphire
#

btw how much faster are the new changes compared to 0.10.0?

#

might have already been said, but it seems like you find new improvements every day and they add up :D

sturdy sequoia
#

and that's without some of the other optimizations that I am cooking on the side

left night
#

at this point, I think building the introspector is one of the main bottlenecks in incremental

#

this could possibly be fixed by integrating a hierarchical introspection API with layered caching directly into the frames

sturdy sequoia
#

but we'd need to verify the assumption that the introspector is the slowest bit

left night
#

we need the tracing ๐Ÿ™ƒ

sturdy sequoia
#

(but it could be made to again)

left night
#

I don't mean the introspection itself

#

Just building the introspector

sturdy sequoia
#

Ok, so I have made typst significantly slower trying to re-use allocations

#

fun fact, it's faster without accelerators ๐Ÿ˜‚

left night
#

interesting

#

that was definitely not the case when I first introduced it

#

maybe that's because all tracked methods have become cheaper

#

world is more optimized and introspector, too

#

but I still find it hard to believe that it's really slower

sturdy sequoia
#

it could also have to do with all of the goddamn locks everywhere on the accelerator ๐Ÿ˜‚

sturdy sequoia
#

so:

  • one global accelerator: slower cold, slightly slower incremental
  • one accelerator in Arc per-Tracked: fastest cold, fastest incremental
  • sharded accelerators: slightly slower cold, barely slower incremental (but doesn't work right yet)
#

I have a feeling that the upgradable rwlock is super duper slow

left night
#

The arc version used Mutex right?

#

So this is Apples to Oranges

sturdy sequoia
#

because in the sharded accelerators I still need mutable access to the one accelerator

#

so I .read() over the list of accelerators then .write() on the individual accelerator

#

and I .write() the list of accelerators when calling evict but that doesn't matter too much imo

left night
#

ah

#

what you'd need is a lockfree bump allocator

#

but even if that exists, you could probably not clear it through an immutable reference

sturdy sequoia
#

technically bumpalo is lockfree

#

so maybe that's the solution ๐Ÿค”

#

it's not Sync ๐Ÿ˜ญ

quaint blaze
#

Average rust generic signature

#

So will multi-threaded typst come out next update?

#

I mean considering you've done all you can to do less work the next step is being able to do more work per second

low sapphire
sly pecan
#

Off to buy a 7995WX for typst

quaint blaze
#

GPU parallel typst rendering when

#

Imagine cetz running on the gpu

sly pecan
lunar kettle
#

Only image export

sturdy sequoia
lunar kettle
#

Which would be good for web app live preview

quaint blaze
#

At least I can use all 16 of my cores now

sturdy sequoia
quaint blaze
sturdy sequoia
quaint blaze
low sapphire
quaint blaze
#

I have the 4ghz I WANNA USE THE 4GHZ

lunar kettle
quaint blaze
#

WebGPU is neat though

#

A layer between vulkan, metal, opengl, etc. and your shaders

sturdy sequoia
left night
quaint blaze
sturdy sequoia
quaint blaze
feral imp
sturdy sequoia
#

I have made 15,63โ‚ฌ so far ๐Ÿ˜„

sturdy sequoia
#

@left night as far as I can tell, the best options are:

  • local hashmaps in an Arc
  • a single global hashmap
#

the sharding thing just... isn't working ๐Ÿ˜ฆ

tight glade
sturdy sequoia
#

Ok, I have recovered the lost performance ๐ŸŽ‰

#

@left night It's still slightly slower than a local hashmap, but it remains Copy

glossy shore
sturdy sequoia
sturdy sequoia
#

Ok, I managed to recover the last lil' bits of performance ๐Ÿ˜„

quaint blaze
#

Not sure if this would work but would it be performant to write a comemo cache to disk and when typst starts again load that cache

#

For persistent caching

sly pecan
#

Especially with all the recent optimizations

quaint blaze
#

Interesting

feral imp
#

I also think the issue is... Windows support? They were discussing some channel socket business, but windows was a drag.

sturdy sequoia
# quaint blaze Not sure if this would work but would it be performant to write a comemo cache t...

Possible? yes absolutely, necessary? imo (and even more so with all of the tops coming) not so much. You have to realize that we cache lots of things which leads to ~3GB of use for my thesis (imo we probably cache too much), reading and writing to disk would be excruciatingly slow, and finally, we frankly don't need it: unless your document is really really big it will takes like ~20s to compile with these improvement (see oi-wiki with the improvements being massively down), so the further gains would be to maybe make that < 10s but with all of the complexities of disk caching

#

Additionally, there is the problem of deciding where to store them ๐Ÿค”

left night
#

It's also not that simple technically

#

For instance, we're hashing a lot of pointers to statics, which can be affected by ASLR (Address Space Layout Randomization)

sturdy sequoia
left night
#

it's a bunch of problems and I don't think it's worth it

sturdy sequoia
quaint blaze
sturdy sequoia
quaint blaze
#

Yeah

feral imp
#

Make it work, make it fast, make it work on hardware from the last decade, make it work on IE7, make it rain โ˜”

quaint blaze
#

If typst can do that I think it's succeeded in being speed

#

If I can get it to run on my 15 yr old Lenovo laptop locally without performance issues I'll count that as a win

sly pecan
feral imp
sly pecan
quaint blaze
#

but yeah 8GB is reasonable

#

If we can get 4GB running well thats the best outcome imo

sly pecan
#

This is from the Steam hardware survey. Obviously biased towards higher ram, but it's a data point

quaint blaze
#

I'd say that 8gb is probably the most practical minimum

left night
#

It's anyway desirable for Typst to run well with 4GB max to itself (which is already a lot of course), since that's the WebAssembly limit

lunar kettle
#

at least until memory64 becomes more prevalent ๐Ÿ˜„

#

but yeah I guess 4GB should be a sensible limit

tight glade
#

Honestly, I would be surprised if Typst wasn't already running pretty well on old hardware

sly pecan
untold turret
#

I feel that a best solution is to adopt a circuit breaker for memorization, because if we don't reach the memory limit, the more memory we utilize the faster compilation we get. When it detects the big document takes too much memory, it disables some memorization that has relative low efficiency to ensure experience.

sly pecan
sly pecan
sturdy sequoia
sturdy sequoia
sly pecan
#

#naive

sturdy sequoia
left night
#

It was intentional because I wasn't sure whether setting 4GB max would fail in some browser, so I wanted to test with a lower limit first

#

Because the docs aren't clear about whether the full amount is always allocated up front when using shared memory

untold turret
#

The cache evict strategy tends to behave like a smart (@sly pecan, e.g. dynamically prioritized based) generational GC, so I'm not sure whether we could also stole some techniques from them and still keep simplicity.

sturdy sequoia
untold turret
quaint blaze
#

Me when the thing to compile my code into pdf contains a generational GC for performance reasons

#

Typst is absurd in some ways and I love it

onyx furnace
#

just checked the new parallel comemo pr and i dont quite understand what accelerator does. i guess i've missed a lot of disscussion. ๐Ÿ˜‚can anyone tell me about that or improve the docs?

sly pecan
#

I just wanna bring up this message by @sturdy sequoia #contributors message Running plugins in parallel would be massive for cold compiles on documents that use them extensively

#

(Say a plugin that adds support for an image format for instance)

sturdy sequoia
#

validation is the act of checking whether two runs have returned the same values, which Typst uses to check whether the doc has converged

sly pecan
sturdy sequoia
#

They were already there

onyx furnace
#

oh, i thought it was new

sturdy sequoia
#

so clearly they're needed ๐Ÿ˜‚

left night
sly pecan
#

I guess testing would be required, but if I had 100 calls to a plugin, each taking 0.1s, I'm guessing it might be worth it to run them in parallel

#

even with overhead

glossy shore
#

Perhaps plugins could somehow hint to their parallelism support

#

Plugins that really don't need it will just not ask for it, and those that do will have to ask for it and guarantee themselves that they won't break

sturdy sequoia
#

parallel: false by default

glossy shore
#

I think it could be more refined than that but that's definitely also an option

#

Also is there any way for plugins to directly generate PDF elements or Typst values?

glad urchin
#

no

#

they arent aware of typst at all

glossy shore
#

I figured

glad urchin
#

other than the minimal api

sturdy sequoia
glossy shore
#

Shame

#

could have real potential

lunar kettle
#

and could become a real mess ๐Ÿ˜‚

glossy shore
#

fair

sturdy sequoia
lunar kettle
#

i think the way we have it now is a good balance, at least for now

glossy shore
#

for now yeah, it's a good MVP

lunar kettle
#

adding typst values somehow would be more reasonable, but writing pdf directly would be pretty hard I think

sturdy sequoia
#

where they can pass serialized frame and load them into Typst

lunar kettle
tight glade
left night
#

@sturdy sequoia I've reviewed the comemo PR

sturdy sequoia
sturdy sequoia
#

@left night regarding the last_was_hit feature in comemo, I made it default because tests can't specify their own feature ๐Ÿ˜ฆ

feral imp
#

(and frankly, I'd do that and make a test-all feature that can be used in CI / local testing that includes this features, and then remove it from default, just for the cleanness)

feral imp
# glad urchin but then you dont test...?

I mean cargo test --all-features is definitely there for a reason, especially if your features are all additive, if not, then you'd make a feature say test-all that includes all additive features, and then it's cargo test --features test-all............

glad urchin
#

well, if that works then sure

left night
#

it's a bit of a pain because you still need to pass them manually

#

but that's life

sturdy sequoia
#

@left night I am done with the code review items, I am just testing whether the immutable: HashMap<...> is really necessary ๐Ÿ™‚

#

@left night (sorry for double ping) what would you replace it by, the old method?

#

or just always adding immutable calls? ๐Ÿค”

#

I can also test both

left night
#

Probably the deduplication is still worth it

#

But hard to say

#

Since I don't want to overoptimize it for the Tracer usage, I would probably keep it

sturdy sequoia
#

@left night As far as I can tell, on my thesis at least the two methods are equivalent, without the hashmap is barely faster (like 10ms on an incremental run of half a second) but it is slower in cold by 0.1s

#

So they're pretty much equivalent

#

And I just tested without dedup

#

and it's a good 30% slower than either methods ๐Ÿ˜‚

#

I'll keep the "simpler" manual dedup method then ๐Ÿ™‚

#

@left night it's pushed

left night
#

@sturdy sequoia does the non Send + Sync version really need a separate surface?

#

since the Send + Sync version forces the type to be Send + Sync anyway...

sturdy sequoia
#

@left night I did it because the TrackedMut as the #ty passed to it:

#

So I wasn't really sure whether I could merge them

left night
#

I see. Hmm, the duplication is unfortunate.

sturdy sequoia
left night
#

Maybe a better alternative would be to either just require Send + Sync or allow the annotation to be #[comemo::track(unsync)]. in either case, I think we want to generate just one.

#

since we don't personally have a use case for unsync, we can also just skip it I guess

#

the return value must also be sync, so it's a bit moot

sturdy sequoia
#

it's a bit annoying imo to have to write Tracked<dyn Trait + Send + Sync>

#

What I don't understand is that if trait Trait: Send + Sync { ... }, why do we need to specify it again

#

?

left night
#

did you try that?

sturdy sequoia
#

I think so

#

but I can try again

left night
#

maybe we don't, from a quick test

#

if we can just annotate the World trait Send + Sync, that'd certainly be ideal and no changes in comemo would be required (compared to main)

sturdy sequoia
#

from my quick test it seems to be the case

left night
#

very nice!

sturdy sequoia
#

Lemme double check before I make a fool of myself ๐Ÿ˜„

left night
#

Tracked<'_, dyn World + Send + Sync> was indeed ugly

sturdy sequoia
#

well it's not needed somehow

#

I guess though some of the other cleanups

#

@left night there are still some changes needed to World regarding interior mutability for World to be truly Send + Sync

#

(removing OnceCell, RefCell, etc.)

left night
#

Those are on my branch

sturdy sequoia
#

And I confirm, typst works just fine by just swapping comemo and modifying the interior mutability (as well as some stuff that returns Self which isn't supported anymore ๐Ÿ˜ )

left night
# sturdy sequoia It's pushed ๐Ÿ™‚

I did some refactoring on the PR. In particular, I replicated the Inner optimization from the mutable constraints to the immutable constraints for less locking, split the thing up into a few more files, and switched to required-features for tests. Feel free to take a look / test it yourself before I merge.

sturdy sequoia
sturdy sequoia
#

@left night as a final tally, this provides a 30-35% performance improvement in incremental ๐Ÿ˜‰

untold turret
#

๐Ÿฅบ @sturdy sequoia idea?

sturdy sequoia
# untold turret ๐Ÿฅบ <@130737672951037952> idea?

I think the easiest in this way will be to have your js handling on a separate thread and use channels to communicate from the world to the js. That will keep world send and sync while not requiring js values to be send and sync

untold turret
#

That may increase much cost to communication, but I think that's a general idea to allow non-sync things to be sync

#

๐Ÿ‘ฟ still be pain to be honest, I may use unsafe impl Send before I have time to make a that sync wrapper...

#

@sturdy sequoia Another question, will multi-threaded comemo become slower if its user doesn't use any parallel thing like rayon? My some private projects (not related to typst) use comemo, but they are all run in single thread as intended.

sturdy sequoia
sturdy sequoia
#

Maybe you could just use a patch in Cargo.toml to force use the old version of comemo (single threaded)?

sturdy sequoia
cunning wadi
sturdy sequoia
#

OH YES

#

Time for a release? or do you want to test it some more?

left night
#

not sure. some more testing might be sensible.

untold turret
sly pecan
untold turret
left night
#

since we control comemo, I guess we can switch to a git dependency temporarily

#

I just don't want to use git deps for unreleased things that we don't control because it could mess with our releases

sturdy sequoia
sturdy sequoia
sturdy sequoia
#

I can open a PR from my branch if you want

left night
#

but CI passed?

sturdy sequoia
#

weird

#

'cause it shouldn't

left night
#

which is already on main

sturdy sequoia
#

because the World impl in the CLI and in the tests use OnceCell from the stdlib

left night
#

the bench world I didn't touch

sturdy sequoia
#

ah right, I didn't know that

left night
#

feel free to make a PR instead of me merging my branch

sturdy sequoia
#

Then it looks good!

left night
#

it's your work after all

sturdy sequoia
left night
#

please use a git dep with rev instead of a patch though

#

otherwise typst can't be used as a library properly

left night
sturdy sequoia
#

BTW, as a quick update, compared to pre-content rework, we are at around 5x faster incremental and 2.5x faster cold

#

INSANE

#

Without parallel typst which brings that even higher

left night
#

pretty nice, eh?

#

good work

sturdy sequoia
#

I think there is still some perf on the table while remaining single threaded, but we've already done so much!

#

So hum... I have two hours to make slides ๐Ÿ’€

left night
#

oh no, godspeed

keen scroll
sturdy sequoia
low sapphire
#

you lied TanyaAngry
I tried the recent changes and it's a major improvement again xD

sturdy sequoia
feral imp
#

Exposed.

sturdy sequoia
low sapphire
low sapphire
#

yeah

sturdy sequoia
#

From which version????

low sapphire
#

I compiled the latest main via cargo sometime yesterday

#

from 0.10.0

sturdy sequoia
#

Okay, what cpu are you using?

#

Because the gains should be 30% incremental but pretty much zero cold

low sapphire
#

should be Intel Core i5-8265U

sturdy sequoia
#

Goddamn it confirms my theory that lower end cpus benefit more

#

๐Ÿ˜Ž

sturdy sequoia
low sapphire
#

wait i'll measure it again

#
Benchmark 1: typst c document.typ
  Time (mean ยฑ ฯƒ):      1.806 s ยฑ  0.018 s    [User: 1.540 s, System: 0.623 s]
  Range (min โ€ฆ max):    1.787 s โ€ฆ  1.849 s    10 runs
 
Benchmark 2: typst-dev c document.typ
  Time (mean ยฑ ฯƒ):      1.028 s ยฑ  0.024 s    [User: 1.247 s, System: 0.086 s]
  Range (min โ€ฆ max):    1.012 s โ€ฆ  1.095 s    10 runs
 
Summary
  typst-dev c document.typ ran
    1.76 ยฑ 0.05 times faster than typst c document.typ

"typst-dev" is 41c0dae2, "typst" is just 0.10.0

#

that was a different document though (35 pages). I make a lot of use of states and some calculations

#

Simple edits for incremental:

  • typst: 175ms - 240ms
  • typst-dev: 150ms - 200ms
#

idk how I would easily pipe this into hyperfine though^^ just an estimate, so cold compile time definitely improved a lot more

low sapphire
#

@sturdy sequoia I haven't tried it yet, but do any of the recent developments also make wasm faster? I remember it being very slow, especially for cold compilations when it first came out

sturdy sequoia
lunar kettle
#

I mean using it in the compiler is easy but integrating it into the web app is hard

sturdy sequoia
left night
sturdy sequoia
#

Should be fairly easy

glad urchin
left night
#

That would still need world integration

#

Which I was hoping to avoid

#

It's also a bit unfortunate to unconditionally spawn evermore workers just in case although that will need to happen for rayon to work anyway

#

In short: I originally thought I could quickly use the browser wasm and just ship it in a day, but since it's more complex, I shifted it back in my priorities.

sly pecan
sturdy sequoia
#

Which is why itโ€™s so sloooooooow

sly pecan
keen scroll
feral imp
#

Node.js != the browser. I've learned this, is this true

sturdy sequoia
onyx furnace
#

i found this. maybe it can help typst.ts and (webapp?)

#

A wrapper type that allows you to send !Send types to different threads.

sturdy sequoia
#

@left night Can you explain to me why you don't want the plugin interface to be part of World? Just curious as to what's the rationale here, it could be just a Sender<PluginCall> or something like that

left night
sturdy sequoia
low sapphire
#

Nice!

left night
sturdy sequoia
#

that's it really

low sapphire
left night
#

you probably mean parking_lot vs std

sturdy sequoia
#

or am I confused? thonk

#

Ah no, the big difference is that OnceCell has the .wait() method

#

whoopsy

left night
sturdy sequoia
#

I confuss meself

left night
#

Regarding Hash and PartialEq

#

Does it really happen only later for the pattern or immediately?

sturdy sequoia
#

in my initial work that's what I had done

left night
left night
#

waiting directly seems cleaner

sturdy sequoia
#

@left night do we have any CI for the CLI as a whole? Like testing the CLI itself not just the "engine" so-to-speak

left night
#

No

feral imp
left night
#

@sturdy sequoia doesn't have to be part of this PR, but parallel cmap creation and font subsetting is probably also relatively low-hanging fruit.

sturdy sequoia
#

Where should I look? ๐Ÿค”

#

create_cmap I suppose?

left night
#

Just the subsetting would be simplest, but the cmap creation mutably borrows the glyph_set that's used by the subsetting, so doing cmap -> subsetting sequentially per font, but for all fonts at the same time, seems reasonable

#

I don't think the glyph_set is used after that loop though, so it could probably be taken

sturdy sequoia
#

@left night I doubt I did it how you wanted so here's what I did:

  • First: I prepare the items by allocating all of the refs etc. on a single thread
  • Second: I use a par bridge to go multithreaded where I do all of the font stuff
  • Third: I finalize on a single thread to write to the PDF
#

does that makes sense or is it wayyyy too overengineered?

left night
sturdy sequoia
left night
#

Why do you need to allocate the refs in the first part?

sturdy sequoia
#
let processed_fonts = ctx.font_map
    .items()
    .map(|font| {
        prepare_font(&mut ctx.alloc, &mut ctx.font_refs, &mut ctx.glyph_sets, font)
    })
    .par_bridge()
    .map(process_font)
    .collect::<Vec<_>>();

for processed in processed_fonts {
    finalize(&mut ctx.pdf, processed);
}
#

This is essentially what happens

left night
#

I had thought to just prepare the cmap and subset font and then merge First + Third

left night
#

maybe we should off for now

#

and instead design a proper architecture for the whole export

#

where we (likely) have indeed the three phases you described

#

but for everything

sturdy sequoia
#

Okay, so I don't push the font changes

#

On my thesis, I cannot measure the change either way ๐Ÿ’€

#

But I guess this would be more impactful on CJK stuff

left night
#

that's curious, I would have thought that subsetting is quite expensive

#

but for CFF likely indeed much more than for TrueType

sturdy sequoia
#

I mean overall PDF export isn't that slow

#

image encoding was extremely slow

#

but since it's been multithreaded it's very decent

left night
#

in the naive parallel test I did, it was quite slow

#

but probably just because the test was naive and everything else was very fast

#

30000 page document or sth

sturdy sequoia
#

I mean right now it's only parallel across fonts, right?

#

not within a single font

#

so if a single font is slow it won't change much

left night
#

yeah

low sapphire
sturdy sequoia
low sapphire
sturdy sequoia
low sapphire
#

I tried the same from last time

#

seems to be slightly slower on my hardware, but only by a few milliseconds

#

but if it speeds it up for other docs, I guess that's fine

sturdy sequoia
#

How long is your doc and how many cores do you have?

low sapphire
#

35 pages, 4 cores

#

I just run a test with hyperfine --warmup 3 -m 100 ...

#

your PR is ~10ms slower on my laptop

#

for that payload

low sapphire
#

it's slightly faster on the same template I used, but without any "content"

#

which has 5 pages

sly pecan
low sapphire
#

I only checked out his branch and compiled it

sly pecan
#

the documents I've tested it on might have other bottlenecks

#

that being said, I only tried pdf

low sapphire
#

good point

#

lemme try svg

low sapphire
#

1.011 s vs 989.5 ms

#

might be even better on newer CPUs, idk

sly pecan
#

@sturdy sequoia by the way I'm seeing a performance increase with target-cpu=x86-64-v3

#

it's small, but there

#

like <1%

#

granted my testing was very haphazard

sturdy sequoia
#

compared to what? ๐Ÿค”

sly pecan
#

x86-64-v3 does however exclude CPUs older than approximately 10 years

sturdy sequoia
sly pecan
sturdy sequoia
#

figures

sly pecan
#

Anyway, it's free real estate

sturdy sequoia
#

which I'm sure Laurenz won't like

#

(not that I care, if your CPU doesn't have AVX2 just throw it in the trash)

sly pecan
#

(although I can't imagine that many people are using pre-haswell computers today)

sturdy sequoia
low sapphire
#

Penryn is the code name of a processor from Intel that is sold in varying configurations as Core 2 Solo, Core 2 Duo, Core 2 Quad, Pentium and Celeron.
During development, Penryn was the Intel code name for the 2007/2008 "Tick" of Intel's Tick-Tock cycle which shrunk Merom to 45 nanometers as CPUID model 23. The term "Penryn" is sometimes used to...

rigid latch
#

๐Ÿ™‚ Naive question. What are you optimizing? Typst is freakingly fast for me.

rigid latch
#

Is there anything that is slow in particular?

sturdy sequoia
#

I should also mention that you must think of it this way: every optimisation has a big impact on people on low end devices since theyโ€™re constrained by the performance of their hardware

feral imp
#

I think before dherse started, there were a few issues... Thesis long documents were quite slow to compile.

And it was too slow frankly.

Plus I've tried to start a paper with a long spurious png files in it, and performance suffered quite fast.

rigid latch
#

Indeed, I had performance issues before 0.8.0 and 0.9.0 release. But now most of it is gone. I wonder what is left to do. If png files are big, I wouldn't expect the compilation / rendering be fast anyway.

sturdy sequoia
feral imp
sturdy sequoia
#

I made comemo parallel, but he's making the engine parallel

violet axle
lament fulcrum
#

that seems like another good benchmarking candidate if you can share it or remove sensitive information

#

and maybe theres even some overuse of state and locate one can get rid of

lunar kettle
violet axle
#

Yes, but it is much faster on my newer system. It was more of a relative comparison to the latex times.

#

For sharing I would need either a more private setting or some cleanup time since we have solutions included. (which get integrated depending on a boolean flag)

sly pecan
violet axle
#

No, there are also quite a few tables and graphics (some of which we converted from pdf to svg). The document is only ~100 pages long.

sly pecan
#

18 seconds sounds too slow for that, unless this is on a potato

#

Anyway, latex can be quite fast on "simple" documents (i.e. no fancy packages etc). Probably faster than typst cold compile can hope to be (at least single threaded)

left night
#

because of more primitive text shaping in pdfLaTeX?

sly pecan
sly pecan
#

And typst presumably introduces extra overhead in order to be able to incrementally compile

left night
#

Yes, that's true, although I don't think it's that much

sly pecan
#

I haven't tried with all the performance improvements, but pdftex was definitely faster on just pure text

#

Possibly even luatex, but I can't recall

#

Should try again some time

left night
#

The text layout itself hasn't really be optimized yet in Typst

sturdy sequoia
hoary dew
untold turret
#

@sturdy sequoia when I upgrade comemo, I find this function is completely broken. ๐Ÿซ  Now I want to have a thread local comemo and a threaded comemo in same program.

sturdy sequoia
#

Why does it break BTW?

untold turret
#

the return type is not send

#

*It was ever mentioned in this forge. But I didn't replace comemo and get errors on comemo::memorize macros.

#

btw, Is comemo::evict weird if there are multiple threads that run different incremental tasks?
In typst preview, there is a incremental compile thread, which evicts comemo cache after each compilation, and there is a incremental export thread, which evicts comemo cache after each render task.

sturdy sequoia
sturdy sequoia
#

That's indeed an issue ๐Ÿ˜ญ

untold turret
sturdy sequoia
sturdy sequoia
#

all of the Tracked<T> will still work, they'll just be slower

untold turret
#

you are right, comemo is safe to evict at any time, but that looks a bit weird.

sturdy sequoia
#

Essentially since function only memoize once they reach the output, evicting doesn't affect them.

#

and for accelerators in tracked structs, it just disables accelerations for outdated tracked

#

this is because it is a fairly dumb generational GC with a max generation of 1 for accelerators

untold turret
#

So how can I go it on ๐Ÿ’€.

sturdy sequoia
#

I wonder if we could have a #[comemo::memoize(local)] too, it shouldn't be too hard

#

I'll see if I can do that ๐Ÿ˜‰

untold turret
#

So we will have an evict and an evict_local?

sturdy sequoia
untold turret
#

@sturdy sequoia If we borrow design ideas from previous work, to make threaded comemo really match rust's thread model. I think tokio's runtime handler design is nice to check. they store the a lightweight metadata in thread local. Then, a thread function can easily creates an isolated async handler, enters that handler, or exits that handler. The handler is also sendable among threads so that they can group threads into different async isolates.

#

Similarly, comemo can have a Handler to create. And we group threads and caches by handlers.

sturdy sequoia
#

@untold turret

sturdy sequoia
glossy shore
untold turret
#

๐Ÿฑ having local is great enough. In preview scenario, A single compiler uses evict and rest render tasks uses evict_local.

glossy shore
#

Can't you just &'static mut ()

sturdy sequoia
#

The only problem @untold turret is that you cannot pass Tracked into locally memoized functions ๐Ÿ˜

untold turret
#

But lsp may have multiple compilers.

sturdy sequoia
#

so they won't ever interfere

sturdy sequoia
sturdy sequoia
untold turret
#

Not a problem till now.

glossy shore
#

Oh right mut pointers are send if their payload is sync, which () is probably

#

let me check that

untold turret
sturdy sequoia
#

mut references can be send, mut pointers never are

glossy shore
#

yup seems so

#

I guess because mut pointers are Copy

#

Sad that we can't just

struct NonSend;
impl !Send for NonSend {}

in stable yet.

glad urchin
#

yeah you need to use phantom data with pointer or w/e

glossy shore
#

I guess you could &() as &dyn Any

#

Or you'd probably need like Eq or something right?

sturdy sequoia
#

@untold turret I did find a way of passing Tracked inside of a local memoized function!

glossy shore
#

Well it might look something like this

#[test]
fn test_unsend() {
    trait OpaqueNonSend: PartialEq<()> + Debug {}
    impl OpaqueNonSend for () {}
    #[comemo::memoize(local)]
    fn add(_a: u32, _b: u32) -> &'static dyn OpaqueNonSend {
        &()
    }

    test!(miss: add(1, 2), &());
    test!(hit: add(1, 2), &());
    test!(miss: add(2, 3), &());
    test!(hit: add(2, 3), &());
}

I don't know if this is really better with all the trait mess

sturdy sequoia
glossy shore
#

Yeah I know

#

Again I'm not so sure if this makes things any easier to think about

sturdy sequoia
# untold turret how do you do that?
    #[comemo::memoize(local)]
    fn dump<'local>(mut sink: TrackedMut<'local, Emitter>) {
        sink.emit("a");
        sink.emit("b");
        let c = sink.len_or_ten().to_string();
        sink.emit(&c);
    }
#

You must manually annotate the lifetimes specifically with the name 'local!

#

Internally it uses substitutes the 'local for 'static

#

it's a bit ugly but it works ๐Ÿ˜

glossy shore
#

len_or_ten?

sturdy sequoia
glossy shore
#

oh ok

sturdy sequoia
untold turret
sturdy sequoia
#

internally it does the following:

#
type __ARGS<'local>  =  <::comemo::internal::Args<(TrackedMut<'local,Emitter> ,)> as ::comemo::internal::Input>::Constraint;
#

It's a limitation of how thread_local in std resolves lifetimes, it's weird because it should work but it doesn't...

#

Probably running in a weird edge case in rustc

untold turret
#

thread local is undoubtly a edge case. We have... many crates to loose our coding live.

sturdy sequoia
#

the thing is that in the non-local version, we don't need to annotate lifetime and it still resolves just fine

#

but in the local case, where it goes through the thread_local macro we do

#

so it's really dumb

sturdy sequoia
#

@glad urchin in tablex, there is this function:

// Measure a length in pt by drawing a line and using the measure() function.
// This function will work for negative lengths as well.
//
// Note that for ratios, the measurement will be 0pt due to limitations of
// the "draw and measure" technique (wrapping the line in a box still returns 0pt;
// not sure if there is any viable way to measure a ratio). This also affects
// relative lengths โ€” this function will only be able to measure the length component.
//
// styles: from style()
#let measure-pt(len, styles) = {
    let measured-pt = measure(box(width: len), styles).width

    // If the measured length is positive, `len` must have overall been positive.
    // There's nothing else to be done, so return the measured length.
    if measured-pt > 0pt {
        return measured-pt
    }

    // If we've reached this point, the previously measured length must have been `0pt`
    // (drawing a line with a negative length will draw nothing, so measuring it will return `0pt`).
    // Hence, `len` must either be `0pt` or negative.
    // We multiply `len` by -1 to get a positive length, draw a line and measure it, then negate
    // the measured length. This nicely handles the `0pt` case as well.
    measured-pt = -measure(box(width: -len), styles).width
    return measured-pt
}
#

Could that not just be using the calc.sign and calc.abs to measure only once? I'm asking because the measure calls really add up

#

Maybe that would be better handled if we had a len.resolve(styles) method, that way you could obtain an absolute size from a relative one (i.e ems)

#

@left night Do you think we could have this: length.resolve(styles) it would also take the styles argument (just like measure) but work directly with a length as a much cheaper alternative to relying on measure here:

    /// Resolve this length to an absolute length.
    #[func]
    pub fn resolve(
        &self,
        /// The styles with which to measure the length.
        styles: Styles,
    ) -> Length {
        let styles = StyleChain::new(&styles);
        Length { abs: self.abs + self.em.resolve(styles), em: Em::zero() }
    }
#

It's a dead simple function too

sly pecan
#

You could do measure(stack(dir:ttb, box(width: len), box(width: -len))).width

#

That would give the absolute value of the length I believe

#

But I haven't actually tried...

#

@sturdy sequoia

sturdy sequoia
#

Still I think having a resolve function makes a lot of sense @sly pecan

sturdy sequoia
# sly pecan Sure

Although I get with the whole context thing we might have it be easier

left night
#

But since it will all be breaking anyway, there's probably not much harm in it.

glad urchin
#

Also the second measure technically only happens with 0pt measurements anyway

#

Well, <= 0pt
So it should be relatively rare

#

But still thereโ€™s no way to convert em to pt otherwise atm

#

It could be made a bit more efficient by just checking em and doing sign and stuff yeah, but Iโ€™d have to restrict that to Typst 0.11.0 since thatโ€™s what we can currently โ€œdetectโ€ inside tablex code

#

Either way we can discuss this more later but the main problems are the technical limitation of using measure and version compatibility between different approaches

glad urchin
sturdy sequoia
glad urchin
sturdy sequoia
glad urchin
#

anyway np i made my own sign function

#

it's glorious

#

(from stackoverflow of course :D)

feral imp
glad urchin
#

@left night do you have any opinions on creating a built-in calc.sign? ๐Ÿ‘€

left night
#

No opinion

glad urchin
#

oh

left night
#

I'm not a huge fan of calc overall though

#

I wish it were methods on a number type

#

That can also be called as num.func

glad urchin
#

honestly, i have thought the same thing multiple times lol

#

but ig we'd still have to discuss that and consider all viewpoints and talk to the community and laziness is superior

#

๐Ÿ˜‚

#

maybe such a larger-scale change can come with the type rework in the future

untold turret
#

having both (int/float).sign sounds great

left night
#

I'm not sure whether this would make sense with distinct int and float types though

glad urchin
#

well, we'd probably have to implement the same methods on both somehow

#

but personally I'm in favor of keeping distinct types, using int at least you have no chance of getting precision errors or stuff like that. seems fair enough to me

#

not to mention that there are indeed situations where the distinction is important (e.g. some-array.at(5.5) would error)

#

but thats a bit off-topic for this thread so we can probably discuss this further elsewhere if needed

#

:p

left night
#

and e.g. float.sqrt would mostly work, but float.min would yield a different type

#

maybe it's not a big problem

glad urchin
#

well yea, float.min would have to return some enum

#

i guess in the end there is some value to calc for those multi-type things. cuz I mean, .min could also make sense for length types for example

#

so it's a bit hard to decide on the best way to proceed here

glad urchin
feral imp
sturdy sequoia
# glad urchin it's glorious
GitHub

Adds the .resolve(styles) method to length to allow retrieving a length as absolute (without the em part) when in the right context.
This changes fixes packages like tablex (https://github.com/PgBi...

GitHub

Closes #3113
I went for the first approach suggested @PgBiel of specializing calc.sign to work differently whether it's a float or an integer. This can of course be trivially changed.

glad urchin
#

Thanks ๐Ÿ‘Œ

#

I think length.resolve(styles) would probably just become length.resolve() or similar with the context proposal so that's cool

sturdy sequoia
untold turret
#

@sturdy sequoia when I upgrade to new comemo, the performance degrades a lot on small documents.

// baseline (before)
https://github.com/Myriad-Dreamin/typst.ts/commit/71fccd6814be2a1daeea01f23b2567c9fbf5a484
typst_ts_bench_lowering  fastest       โ”‚ slowest       โ”‚ median        โ”‚ mean          โ”‚ samples โ”‚ iters
โ”œโ”€ lower_cached          293.8 ยตs      โ”‚ 1.493 ms      โ”‚ 339.6 ยตs      โ”‚ 395.9 ยตs      โ”‚ 100     โ”‚ 100
โ”œโ”€ lower_incr            10.73 ms      โ”‚ 17.19 ms      โ”‚ 11.66 ms      โ”‚ 11.89 ms      โ”‚ 100     โ”‚ 100
โ”œโ”€ lower_the_thesis      128.6 ms      โ”‚ 280.2 ms      โ”‚ 143.8 ms      โ”‚ 152 ms        โ”‚ 100     โ”‚ 100
โ•ฐโ”€ lower_uncached        1.097 ms      โ”‚ 2.142 ms      โ”‚ 1.214 ms      โ”‚ 1.263 ms      โ”‚ 100     โ”‚ 100

// with threaded comemo (after)
https://github.com/Myriad-Dreamin/typst.ts/commit/52ae23bd8003c73650956148c7a12a29f32f9fc9
typst_ts_bench_lowering  fastest       โ”‚ slowest       โ”‚ median        โ”‚ mean          โ”‚ samples โ”‚ iters
โ”œโ”€ lower_cached          326.9 ยตs      โ”‚ 3.153 ms      โ”‚ 419.3 ยตs      โ”‚ 460.9 ยตs      โ”‚ 100     โ”‚ 100
โ”œโ”€ lower_incr            23.93 ms      โ”‚ 33.19 ms      โ”‚ 25.98 ms      โ”‚ 26.56 ms      โ”‚ 100     โ”‚ 100
โ”œโ”€ lower_the_thesis      127.7 ms      โ”‚ 149 ms        โ”‚ 134.9 ms      โ”‚ 135.6 ms      โ”‚ 100     โ”‚ 100
โ•ฐโ”€ lower_uncached        9.389 ms      โ”‚ 18.36 ms      โ”‚ 10.07 ms      โ”‚ 10.48 ms      โ”‚ 100     โ”‚ 100

Among 93 seconds bench time, there are 60 seconds of time spent on mutex/atomics.
๐Ÿ˜จ all top 10 time-consumed functions are mutex.

#

But I'm tired on editing thousands of lines of code broken in lastest typst, I may take a rest and come back to explore more how to rescue the performance degrading.

sturdy sequoia
untold turret
#

I'm on a laptop windows amd64

sturdy sequoia
#

Some degradation was to be expected, but that much contention isn't normal ๐Ÿค”

#

There just isn't anything multithreaded enough

#

what export target?

untold turret
sturdy sequoia
#

because only PNG and SVG should ever cause contention

#

makes sense

untold turret
#

I have tried to add rayon to make it parallelized in page level or each frame. the result from vtune shows that it doesn't help

#

But I should write a cache manually to avoid using threaded comemo for my export with latest typst, to check which parts of code cause performance degrading in end

quaint blaze
#

What's going on with performance stuff rn?

sturdy sequoia
#

I have been on a bit of a break

quaint blaze
#

Ah

sturdy sequoia
#

Is it crazy that changing a SINGLE .unwrap() call to an unwrap_unchecked call saves 6% of the total execution time of typst????

glad urchin
#

๐Ÿ˜‚

untold turret
#

which check did you skiped?

sly pecan
#

I don't know rust, but unchecked sounds dangerous!

glad urchin
#

Me when the Typst produces UB and deletes my entire hard disk

#

Noooooo

sturdy sequoia
#

Well to be fair, we check the condition right before calling unwrap

lament fulcrum
#

means its some path where the compiler cant statically analyze or optimize the unwrap i suppose

sturdy sequoia
#
fn make_mut(&mut self) -> &mut dyn NativeElement {
    let arc = &mut self.0;
    if Arc::strong_count(arc) > 1 || Arc::weak_count(arc) > 0 {
        *arc = arc.dyn_clone();
    }

    unsafe {
        // Safety: We ensured the content is not shared.
        Arc::get_mut(arc).unwrap_unchecked()
    }
}
sly pecan
sturdy sequoia
#

so no matter what, we know that we can Arc::get_mut safely ๐Ÿค”

sly pecan
#

dumb question: can there be some race condition where it changes from another thread between the check and the unwrap?

sturdy sequoia
#

I mean I don't want to introduce unsafe

#

I just thought it was funny

sly pecan
#

further dumb question: can you unwrap without the check beforehand, and instead possibly handle the error after instead?

#

if unwrap does the check anyway

sturdy sequoia
sly pecan
#

shouldn't the compiler be smart enough to optimize away the check if it's possible?

sturdy sequoia
#

I don't know, I'm just struggling to find new avenues for optimization ๐Ÿ˜ฆ

sly pecan
#

Maybe it's mostly fast enough? ๐Ÿ™‚

left night
# sly pecan dumb question: can there be some race condition where it changes from another th...

for another thread to do stuff, it needs to have a ref to the arc itself. in that case the strong count would be > 1. so after the first part of the if condition, it's clear that we are the only one with a strong ref. then we check for weak refs (a weak ref can't introduce a strong one) and there are also none. I'd have to ponder a bit more to be really sure with the weak refs, but generally this kind of check is safe.

#

and for 6% perf across the board, it would honestly be worth it. for 6% in some specific doc, maybe not.

sturdy sequoia
#

but it could happen that a weak reference gets upgraded between the time you checked the strong ref count and the weak

#

that's why Arc::is_unique does some magic stuff to prevent this scenario, unfortunately it's a private method

left night
#

ah okay, yeah as said with weak I'm not 100% sure how it works

#

but with just strong, the reasoning works

sturdy sequoia
#

Because I feel that cloning (and the many atomic operations that come with it) is likely still, I believe, a bottleneck

left night
#

It clones so much because it sets the guard field

sturdy sequoia
left night
#

I am (right now) working on extracting the guard, label, location etc. from the individual elements into a generic Packed<T>

#

This is necessary to be able to merge Content and Value

sturdy sequoia
#

Which is why I was thinking that if we could extract the guards it could work out really well

sturdy sequoia
left night
#

Because e.g. a Color that ends up in the doc also needs to be able to have guards, labels, locations

sturdy sequoia
#

Ideally I think Content should generally be immutable with the exception of synthesized fields

left night
#

And an int

#

show float etc

sturdy sequoia
#

ah right

left night
#

I'm making good progress on this and will likely open a PR early next week

tight glade
#

That's exciting ๐Ÿ”ฅ

sturdy sequoia
#

I wonder how it will improve perfs ๐Ÿค”

#

Or degrade them angryeyes

left night
#

For now, I've kept everything in a single Arc (basically Content(Arc<Inner<dyn NativeElement>>) where Inner holds the stuff. I just want to focus on extracting it and keeping all tests green. The, once Value and Content are merged, I want to optimize the representation a bit.

#

E.g. an int shouldn't be stored as Arc<dyn NativeType>. That's horrible.

sturdy sequoia
#

Ok, I think that's a good first step

left night
#

(The built-in enum will go out the window.)

tight glade
#

Not the built-in enum! I rely on it for typstfmt! /Jk

left night
#

And I want to make Packed<T> coercable into Shared<T> (just a shared ref without the metadata). Then we can use Shared<Gradient> in the output and don't have to deal with Arcs internally for every type to keep it small and cheap to clone.

#

Shared might just be Arc. Not sure. Perhaps also a new nice type that saves a lazily initialized hash alongside the ref count. That'd be neat.

left night
tight glade
#

/joke

sturdy sequoia
#

like the one from triomphe?

left night
#

Probably at least

#

I would probably also write it in a way where the Arc pointer points directly to the data and the header is in front of it. That way Deref is a no-op.

#

EcoVec does that

lunar kettle
left night
lunar kettle
#

thats awesome ๐Ÿ˜„

#

looking forward to that!

left night
#

also big win for datetime

#

because language aware

lunar kettle
#

once there are get rules you could then do something like get the current language and format the float depending on that right?

left night
#

yes

lunar kettle
#

awesome ๐Ÿ˜„

left night
#

as long as you don't format the float into a string eagerly

#

it basically needs to stay a float and end up in the content tree

lunar kettle
#

hmm how would you format it without turning it into a string though?

left night
lunar kettle
#

ahh gotcha!

crystal girder
#

absolutely, yes

sturdy sequoia
#

@left night I've done some preliminary testing on your PR and here are some notes

#
  • While compiles (cold & hot) are pretty unaffected (~2% slower),
#
  • Separate arcs for the metadata and dyn trait does give some gains, especially in incremental
#
  • Using a box is worse
#
  • inlining the value all into the Value is actually not nearly as bad as one might expect, being even faster in some incremental tests!
#
  • Overall, I like the change, and I think it opens the door to better performance and API (Value = Content rework) in the future
#

What I am mostly curious is if we can use this to make Content<T> where T defaults to dyn Bounds such that a lot of elements could instead of having a Content, or a RawLine could have a Content<RawLine> which would avoid needing to reallocate.

#

I also think we could try to have the dyn Bounds be Prehashed to gain a bit of performance and remove all other prehashed

left night
left night
sturdy sequoia
sturdy sequoia
sturdy sequoia
left night
sturdy sequoia
#

Right

#

you're right

left night
#

Which is nice because we can reinterpret a &Content as a &Packed<T>

#

I guess Content<T> might also have worked instead of Packed<T> with #[repr(transparent)] but what I like is that it doesn't depend on any Rust type system coercion foo.

#

It leaves flexibility for how things are implemented.

sturdy sequoia
#

Yes you're right

left night
#

I have a half-finished blog post with a bunch of ideas on how to implement an efficient Value encoding that also supports user-defined types. Unfortunately, it is half finished :(