#Performance

1 messages · Page 7 of 1

sturdy sequoia
left night
#

it's in a layout_grid in the last iteration

#

very interesting: since locate() provides context, I removed all uses of loc and replace with here() etc. and it's still fast

#

but flipping the switch and replacing locate with context makes it slow

sturdy sequoia
#

yeah I know, I tried it too!

#

It's like something in the evalof a ast::Context is slow maybe?

#

but that doesn't make sense

left night
#

it might be that the functions produced by context are harder to disambiguate for the locator or something like that

sturdy sequoia
#

Interestingly, with the VM, both techniques take exactly the same time (which I would expect tbh)

left night
#

yeah, probably you didn't port the minor difference that creates the problem

sturdy sequoia
#

Maybe it's the CaptureVisitor?

#

it has a separate capturer

#

no, doesn't look like that would explain it

sturdy sequoia
#

Ok, I found an awesome heuristics for whether to enable caching with the VM or not: the number of registers it uses

#

works surprisingly well

#

@left night I just had a realization: the VM isn't slower in incremental, it looks like multithreading has slowed down incremental even on main thinkies

#

Maybe the VM isn't that bad after all

feral imp
#

What a twist.

sturdy sequoia
#

maybe we should be clever about when we do multithreading: if the page count is slow we can maybe do it single threaded to avoid the costs of syncing?

sturdy sequoia
#

@feral imp Mind you, it's not perfect and there are definitely areas of improvement

#

compile time are still too slow imo

slim sequoia
sly pecan
#

Sunk cost fallacy is real

slim sequoia
#

👍

#

(but also it was obviously said in jest)

sturdy sequoia
#

Yeah okay, but specifics here 😄

#

I can't optimize everything all at once

sly pecan
#

I'm just trying to save you from sinking more time into the vm only to realize it's not worth it again 😂

sturdy sequoia
#

but I feel like there's a gotcha somewhere

#

I mean it used to be 5x faster and now it's only 2x faster

#

so clearly I made something slooooooooooooooooooooooow

left night
sturdy sequoia
# left night what difference have you observed?

on main, from right before mt it was ~300ms for my standard test, and went to ~500ms afterwards (likely due to overhead of mt), and the VM takes the same ~500ms incremental due to re-compiling

#

mt = multithreading but I am too lazy to write it out

sturdy sequoia
#

Well no, this ain't it chief

left night
#

only page runs are multi-threaded, so if just one page run changes -> no multithreading

sturdy sequoia
sturdy sequoia
left night
sturdy sequoia
#

I would need to do a bisect on a longer period then

#

I'll try and do that once I'm done working

#

'cause it is slower than it used to be

#

(or I am going insane)

sly pecan
#

Platform dependent?

#

Bottleneck may shift

left night
left night
#

There are a few threads spawned that do basically no work (fully memoized) and one thread does the work of the changed page run

sturdy sequoia
#

but I don't know which commit before

#

so I need to do a proper bisect between v0.10.0 and main

#

which won't be fun 😭

sturdy sequoia
#

and I am on Windows

#

which is yucky

sturdy sequoia
#

Almost 50% of the total time on my thesis is eval

#

🧐

#

(this is the wonder of the Intel VTune oneapi)

#

And guess where most of that is spent

#

||you guessed it, it goes in the hashing hole||

#

The fact that actually running code take 23ms is worrying frankly (out of 4s)

left night
sturdy sequoia
#

Mind you, there are other calls it does that take longer, but overall, hashing accounts for the vast (~80%) of time spent using hardware-based counters so I would expect it to be somewhat accurate

sly pecan
sturdy sequoia
#

or 1-4

#

I would argue there's not much leeway in changing that 😐

#

It's mostly maloc and caching stuff

#

eaither Hasher::write or stuff in comemo

#

clearly we can do better

sly pecan
sturdy sequoia
#

If we could reduce it to 64-bit maybe it would be faster but I don't know how much, and I am not sure it would be worth it even

sly pecan
sly pecan
sturdy sequoia
#

The problem is also quality, we need a very high quality hash

#

although perhaps having 128-bits decreases the need for quality somewhat

#

there's also gxhash, but I tested it and saw no performance improvements

#

although it has had a lot of commits since then so perhaps it has improved

sly pecan
sturdy sequoia
#

Like in theory it's king

sly pecan
sturdy sequoia
#

I should also mention: they still don't have a fallback version of the algorithm

#

meaning it only works on ARM & x64

sly pecan
sly pecan
sturdy sequoia
sly pecan
#

I thought wasm had simd

sturdy sequoia
#

I did open an issue early on

sly pecan
#

I thought you meant xxhash

sturdy sequoia
#

xxhash is in C anyway

#

so it's a no-go

sly pecan
#

Did you see the reddit comment I linked? I think gxhash is a non-starter

sturdy sequoia
sly pecan
sturdy sequoia
#

I'll give it a whir 😉

sly pecan
#

But it seems promising

sturdy sequoia
#

I'm testing it, we'll know in a couple of minutes

#

@sly pecan it's slower, by quite a margin in fact

#

likely because it handles smaller inputs less well than SipHash 1-3

#

for the thesis: it's 0.2s slower on average compared to the VM, for the raytracer: it's 4s slower so 19s compared to 15s with SipHash 1-3

#

at least on my hardware *

sly pecan
sturdy sequoia
#

I mean there aren't many feature flags?

#

I enabled xxh3 for 128-bit support and that' sit

#

and doing the whole -C target_cpu=native is cheating imo

#

since we can't distribute that

sly pecan
sturdy sequoia
sly pecan
#

Ok

sturdy sequoia
#

I think that a good way of determining whether we need memoization would be to have a .len() method on Value that gives us an idea of the size of the input, that way if we have large arrays then we can check those, if we don't then we skip memoization, etc.

#

maybe more something like .size-hint()

sly pecan
#

In the documentation

sturdy sequoia
#

ah, it does look like it uses different algos then

#

using the Xx3 struct will make it streaming, using the xx3() function will make it one-shot

#

no idea if that has an impact

sly pecan
sturdy sequoia
#

To be fair, we could have a "launcher" that selects a typst version with or w/o AVX for better performance

#

I tested, using target_cpu makes zero difference

#

🤷‍♂️

stone pilot
#

No, sorry, the last changes were to the sorting algorithms, not hashing.

sturdy sequoia
#

Yes indeed, mind you we do perform some sorting and the new algorithm will probably perform slightly better

cunning wadi
#

This isn't really an option for Typst though, because wyhash (not sure about their modification) only has 62 bits of collision resistance

sly pecan
#

How big a contribution does shaping have towards compilation? Are we talking miniscule?

left night
slim sequoia
#

Optimized linebreaks can really suck up performance

left night
slim sequoia
#

❤️

left night
#

Just had an idea recently on how to optimize the optimization

#

And it's already mostly implemented, just needs polish

sturdy sequoia
sturdy sequoia
glad urchin
#

it's optimizations all the way down

stone pilot
#

It looks like it's also 64bit. Probably the same wycats one used for the sorting improvements

cunning wadi
#

and it would honestly surprise me if it was because the sort function only takes Ord

stone pilot
stone pilot
#

The Fxhash implementation came from that crate as that's the one used in the compiler.

#

I find amazing that Typst needs higher colision resistance than the compiler?

untold turret
#

comemo needs perfect hashing as when there is a collision, typst may not (re)compute something and cause a wrong compilation.

cunning wadi
stone pilot
#

Yes, I first misremembered the sorting changes as related to hashing improvements, but those are separate improvements. Sorry for the confusion!

molten kayak
atomic violet
#

I just had an idea. If hash slow, what if typst no hash typst eq?

There are a few different types of objects I'd say: the Copy ones which are usually O(1) comparable, the Rc ones (any reference counted objects: Arc, Eco*, ...) which are hashable, and the comemo tracked objects. I think the main issue is the second type (most typst values).

Consider PicoStr. These are interned strings managed by a specialized interner. Can comemo be the interner for Rc-type objects? When function is called, comemo hashes the arguments, then looks up the result, and returns the result. In a sense, comemo "manages" the return value (it won't be dropped unless comemo decides to evict the returning call). What if comemo will apply the same interning as with PicoStr, but for all Rc-type hashable returned objects? Actually, what if comemo will manage input objects too?

Consider a CRc type (comemo reference counted) which is Arc but with one extra atomic bit indicating whether it is a canonical instance of the object. When calling a memoized function comemo would check if the bit of CRc argument is set, and if it is not, lookup the canonical instance of the object by hashing it and looking it up. It will then only use the location of that canonical instance in memory, instead of the hash. Then it will return the same canonical instance of the return value, if applicable.

The reason why this optimization might be good (if possible), is because now arguments are perfectly disambiguated by their memory locations. You no longer need high quality hash functions for interning (because Eq-ing objects one time per each instancing is acceptable), and you no longer need high quality hash functions for looking up the cache too (because all arguments are basically numbers now, and you can Eq arguments too). And it's not like you do much more work anyway: you hash everything before interning, but comemo would do that anyway. Although it's hard to estimate the memory usage.

sly pecan
sly pecan
#

I'm too stupid to understand

atomic violet
#

I think it can be, but it's probably extra work with little to no benefit. If the total size of the object is 1 kilobyte, that 1 kilobyte must be crunched through eq-check either way: be it hash or eq. And that size cannot be determined at compile time, because most objects are build from collections (which have dynamic size, obviously). This leaves us with simple structs, which are usually Copy. That's why I kind of distinguish only between Copy and Rc-types, and not based on size.

atomic violet
#

In this case I am confused, I don't see how it will help memory usage 🤔

sly pecan
#

I should stop talking 😂

atomic violet
#

Oh, I get it now. Yeah, it might be a good idea, but a small counter argument is: if you have a big object, you probably haven't built it yourself. It was probably returned by a function, and this function is memoized anyway. So it probably won't help much, but one can never know before one measures so 🤷‍♂️

left night
sly pecan
#

@sturdy sequoia The Thesis

left night
#

I tested it with an earlier version and it was a few percents I think

sly pecan
#

I didn't think so, but it should have some effect

sly pecan
left night
#

I think it will be most impactful with book projects

#

No cetz, no crazy stuff, just tons of paragraphs.

sly pecan
#

Is it possible to use different strategies depending on paragraph length?

left night
#

You mean a different way to bound the paragraph?

sly pecan
#

yeah

left night
#

maybe. could also be that there's a smarter way to apply the bound. that's just what I came up with.

sly pecan
#

I guess there are diminishing returns. Few paragraphs have thousands of words anyway

left night
#

yeah, I'm quite happy that it's this way around, rather than the bound only kicking in late

#

I also enjoy a lot that something that feels like algorithms class actually brings a lot of real-world performance here.

#

I did a course on bounds for graph vertex cover optimization in university, which was somewhat similar.

left night
sly pecan
#

"Unfortunately, this approach of computing widths falls down with proper text shaping"

#

What does luahbtex do when using harfbuzz?

sturdy sequoia
left night
sturdy sequoia
#

But that's okay

sly pecan
#

sounds like latex may cheat a bit?

#

would probably be easiest just to ask the latex and/or harfbuzz devs

#

Another consideration is microtypographical features, which will reduce the need for hyphenation

feral imp
#

That "oi wiki" document would probably show a delta.

sly pecan
#

@untold turret

feral imp
onyx furnace
#

yes. but i cannot find the zip file now😂 i can regenerate one if it's needed

sturdy sequoia
#

😭

#

The version I have is too old to even compile on main @onyx furnace 😂

left night
feral imp
sturdy sequoia
#

But but but.....

left night
sturdy sequoia
#

are there many docs that are bottlenecked by paragraph layout?

left night
#

I would wager to guess that most documents that aren't bottlenecked by something else and use justification are bottlenecked by this.

#

Most prominently of course books

#

And test benchmarks of people coming from LaTeX ^^

sturdy sequoia
#

'cause I have no idea at this point 😄

#

by images & codly

#

I guess I shouldn't be surprised 😄

sly pecan
sturdy sequoia
#

I mean the multithreading helps

#

but on each thread it's still fairly slow

#

not much we can do there I am afraid

sturdy sequoia
sly pecan
sturdy sequoia
#

I would love if we had impecable SVG support

#

it would be so good

sly pecan
#

No one needs more than 640x480

sturdy sequoia
sly pecan
lunar kettle
sturdy sequoia
lunar kettle
#

indeed 😦

sturdy sequoia
#

they're like "we only want to support the web broswer"

#

Like just add an option "inline everything" in your goddamn software ffs

lunar kettle
#

@sly pecan I love how they marked your one comment in a draw io issue as a duplicate 😂

lunar kettle
#

Anyway off topic

sturdy sequoia
#

I wish we had a draw.io like tool that generates Typst code

#

(side project anyone?)

left night
lunar kettle
#

@sturdy sequoia btw does vtune work with programs that only run for a few ms?

left night
#

all the flows and pads and grids

sturdy sequoia
#

you will get barely any samples at that point

sturdy sequoia
left night
#

but maybe actually no. most of those layouts end up in style or context

sturdy sequoia
#

There is one page that takes 450ms per iteration all on its own and it's a big table 😄

left night
sturdy sequoia
#

right?

left night
#

not necessarily

#

if you want to be sure, you'd still need to decode the image, but not encode

feral imp
left night
#

XeTeX splits text at hyphenation points and then shapes only the parts between the hyphenation points. This ignores ligatures/kerning across discretionaries. Then these widths are used for linebreaking. After linebreaking, the horizontal lists are reshaped, this time taking kerning/liagtures across discretionaries into account.

#

so, it's sort of like if we used the approximate result directly instead of as a bound

#

which would speed up things much more. but it's a hack.

sturdy sequoia
left night
#

agreed

sly pecan
#

The alternative would be more granular control than simple and optimized I guess

left night
#

which, from my reading, XeTeX would also suffer from

sly pecan
#

How overfull are we talking? Many sins can be hidden by adjusting spacing

left night
#

typically not much, but fonts can do random shit

keen scroll
#

Just blame it on the font if that happens 😈

left night
sturdy sequoia
#

so you know... we're right 😎

left night
#

pdfLaTeX is a matter of how you view it:

#

I think it's still faster for a single plain compilation with just a lot of text (not 100% sure), but you need to run it multiple times for outline etc. so I guess a single run is more akin to a typst watch cycle

sly pecan
#

What's the deal with this jump? The slope seems to change after too

slim sequoia
#

The post suggested memory bottlenecking, perhaps that's where the bottleneck becomes apparent in the new algorithm

sly pecan
slim sequoia
#

I'm not sufficiently technologically literate to know what that means or if its the case

#

I'm a chemist simpleton 😛

sly pecan
#

Upon further reflection, that doesn't make sense either, because the y axis isn't time, it's the number of lines built

#

It's just strange to me that there's such a distinct jump

left night
#

I assume that this particular bump is specific to lorem ipsum with default settings

sly pecan
#

That makes sense

sturdy sequoia
#

L1/L2 have basically the same latency ||gross oversimplification|| most likely when it can't rely on L3 anymore and has to go to main mem

sly pecan
sturdy sequoia
sly pecan
#

But it's not relevant here

sturdy sequoia
#

hmm maybe

sly pecan
sturdy sequoia
#

well

#

Ich bin dumb

sly pecan
sturdy sequoia
left night
#

@sturdy sequoia does perfetto also always initialize fully zoomed in for you since a while? I think originally it showed the full trace by default but now I need to drag the slicer all the way to the right every time. I can't figure out what the cause is.

sturdy sequoia
#

@left night yes, it's really gosh darn annoying

left night
#

they must've shipped some bug

#

cause I deleted all local data

sturdy sequoia
sturdy sequoia
#

@left night interestingly, on my thesis, I resized (using a script 😅) all of the images and it had zero impact on compile times! (probably thanks to parallel compilation)

#

But, removing the glossary calls in my code blocks saves almost 0.4s of compile time (1.8s -> 1.4s cold)

#

Also makes incremental quite a bit faster 🎉

#

Although it does save almost 1GB of RAM when compiling lol

left night
sturdy sequoia
# left night time to write a more efficient glossary

Yes but this all makes me think that:

  • There should be a basic "image" resizer in the webapp for convenience when uploading like "oh, you uploaded an image and it's quite large, wanna resize it?"
  • We should have a quick guide on how to make fast docs
  • And yes, we need a more efficient built-in glossary 😄
#

BTW @left night could I ask you (and it's likely a big ask) but to compare the VM and main in the webapp? (since obviously I can't do that), I am curious if the VM helps in WASM at all or not 🙂

#

The big advantage of the VM (in theory) is also that it could be JIT'ed in WASM

left night
sturdy sequoia
sly pecan
#

@sturdy sequoia did the new thing affect The Thesis™️?

sturdy sequoia
#

I shall try it then

slim sequoia
#

wouldn't also mind knowing how it it looks different if at all

sly pecan
#

Apart from bug fixes

slim sequoia
#

in which case dherse whatever you do please don't tell me if it looks different at all

sly pecan
proud sandal
#

must've worsened the performance by quite a bit then

sturdy sequoia
#

Y'all know I work like... a difficult job that you know, requires most of my brain power 😂

proud sandal
#

it's still compiling

sturdy sequoia
#

I just got on my PC :-p

#

And then I was stuck in traffic for an hour

#

because even the f-ing E40 was clogged for some f-ing reason that blows my mind

sly pecan
sturdy sequoia
#

Plus I thought I would get fired yesterday, which obviously didn't happen and I was wayyy over reacting

#

but damn he drained me

sturdy sequoia
# sly pecan Whaaaaa

Yeah, dumb company was mad that I used a colleagues license for their software to try if it was the right fit for my work

#

they got into a bit of issy fit while we were on call (just me and them)

#

had to calm them down

#

and immediately talked to my management who said that they were a pain in the neck and that I reacted correctly

#

I was shooketh for sure

#

I wish, I meant "it"

#

😭

slim sequoia
#

The gods hath spoken

sturdy sequoia
#

almost done, just waiting on main to compile release

#

I also optimized The Thesis™️ itself

#

it compiles in barely 1s now!

slim sequoia
#

But but

sly pecan
slim sequoia
#

How are we meant to say "wow goes vroom" if it's already vroom

#

😠

sturdy sequoia
slim sequoia
#

:3

slim sequoia
#

Someone needs to keep track of compile times in a chart

sly pecan
#

No need for incremental compilation anymore

slim sequoia
#

X axis release, y axis inverse of compile time

sturdy sequoia
#

To be fair, incremental is sloooooooooooooooooooooooooow now

#

it takes 0.4s per iteration

#

which is slow compared to cold compile

slim sequoia
#

I blame a lack of VM

sly pecan
#

It's gotten slower? Or just relative to cold

sturdy sequoia
#

I'll do a custom build without comemo in a few minutes 😂

sturdy sequoia
#

but don't know which commit

sly pecan
glad urchin
sturdy sequoia
#

Note: PDF export on Windows native

#

so to me it does look like decent gains """for free"""

#

Wait no

#

I ran the VM runs with timings

#

DON'T LOOK AT THE RESULTS

slim sequoia
sturdy sequoia
#
branch     cold     inc
main pre:  1.63s   401ms
main post: 1.46s   406ms
vm pre:    1.45s   400ms
vm post:   1.36s   410ms

The VM is so f-ing useless on The Thesis™️ lol

#

Ok, There was only one bad result, it's fixed now 😉

#

note that "main pre" is a bit older than just before this commit afaik so the difference should be a bit lower

sly pecan
#

0.1s isn't bad

sturdy sequoia
sly pecan
#

Weird that incremental is comparatively slow

sturdy sequoia
#

but I don't know when

#

For some reason this citation is by far the slowest

#

taking 25 ms all on its own lol

sly pecan
#

I guess incremental performance is harder to figure out

sly pecan
#

Have you tried the speculative execution branch?

sturdy sequoia
#

you mean in incremental? then yes but I don't remember the results

feral imp
#

Incremental has slowed down... But more stuff happens now.... You are tablex free now? But incremental is slower than it used to be.

😬

sly pecan
sturdy sequoia
sturdy sequoia
#

😎

sly pecan
#

I blame autocorrect

left night
sturdy sequoia
left night
sturdy sequoia
left night
sturdy sequoia
#

maybe it's Windows specific

sturdy sequoia
#

sowwy

left night
#

I mean, I did optimize some stuff, but I didn't expect any real-world gains

feral imp
#

Did you mess up with environment variables and the package system maybe?

left night
#

let me see whether I can reproduce this result

feral imp
#

Does your thesis use any package from the universe? It might be something like that... 🤡

sturdy sequoia
sturdy sequoia
#

You should really only compare with the VM here because the main pre was a slightly older build of main

#

(I can't be arsed to do a new one)

feral imp
left night
#

let me see, just compiling --release and it always takes so long

sturdy sequoia
sturdy sequoia
left night
#

no, if anything it's Preparation::slice

#

cause that is O(n) -> O(1)

sturdy sequoia
#

And perhaps accidentally better caching?

sturdy sequoia
#

@left night I think I actually found a """simple""" way of optimizing eval angrythunk

#

Without doing anything fancy

#

But y'all'll have to wait to know what it is 😈

left night
#

I compared directly the commits immediately before and after the PR

sturdy sequoia
#

another AMD win 💪

left night
#

or maybe it's actually Apple Silicon that dealt well with the crappy old code :p

sharp garnet
sturdy sequoia
#

No, it's Cow, with me it's always Cow

sharp garnet
sturdy sequoia
#

@left night isn't our AST flattened?

sturdy sequoia
left night
#

one might try it, but I do not believe that it will be faster

slim sequoia
sturdy sequoia
#

I love Cows

left night
#

and it might make the code less maintainable

sturdy sequoia
#

the rust kind

sturdy sequoia
left night
#

also it would mean a bit more overhead on incremental reparsing

#

but probably negligible

sturdy sequoia
#

Does parsing actually have a performance impact?

left night
#

not much I think

sturdy sequoia
left night
#

nope

#

cause it happens on demand

sturdy sequoia
#

yeah figures

left night
#

but if imports were parallel then we would

sturdy sequoia
#

Indeed

left night
#

though maybe not

#

depends on the World actually

sturdy sequoia
#

if an import or include returned a Deferred<T> i GUESS

left night
#

cause it returns a parsed Source

sturdy sequoia
#

Sorry caps

#

But indeed I don't think World is easy to Send + Sync?

#

or we'd need a kind of ScopedDeferred<T>

#

using scoped threads

sturdy sequoia
left night
sturdy sequoia
#

BTW, I just found a stupidly simple optimization, I'll open a PR for it

#

probably changes nothing in the real world

#

but it's so dumb I can't not open up a PR

feral imp
#

Thesis is the real world (for now)

sturdy sequoia
#

Pretty sure it's my smallest PR by an order of magnitude 😂

left night
sturdy sequoia
#

on very constrained devices it might even make the tiniest of differences 🤷‍♂️

sly pecan
#

@sturdy sequoia how much of the 400 ms is export?

sturdy sequoia
sly pecan
#

Ok

#

Do you know why that citation takes 25 ms?

slim sequoia
glad urchin
#

well

#

that's a lot of linebreaks

#

however i cannot reply with that video unfortunately

#

😂

sly pecan
#

282 million characters is like 150 000 pages

#

What on earth is this person doing

left night
#

if it has ultralong paragraphs, this is actually a case where my latest PR could help because they might be running into quadratic runtime

#

but still a very valid question what the heck they are doing

low sapphire
sly pecan
low sapphire
#

Dherse has insane specs IIRC

left night
glad urchin
#

given they're on a mac, im assuming it's using a ton of swap

sly pecan
glad urchin
#

iirc macos just doesnt care and will fill your whole disk with swap if needed

#

lol

#

so they might be the first person to have typst consume hundreds of gigabytes of memory. or something

#

i already thought my 10000 tables and grids consuming 60 GB of RAM was bad, but then someone in #quick-questions was doing this like for real

and now we have this

#

if anything this shows that the compiler is very resilient 😂

sly pecan
#

Yeah this document is likely on the order of 1 TB of memory usage or more

glad urchin
#

i wouldnt doubt it

#

especially since they said there are footnotes and whatever

#

i wonder if they're using something like AI or whatever to generate such a document

#

cuz theres no way i can think of someone actually typing all that

sly pecan
glad urchin
#

hmmmmmm

#

their profile does suggest they'd do this kind of stuff yeah

#

😂

#

paging @feral imp for an assessment

sly pecan
#

I guess at some point reducing memory usage should be looked into. But this particular one sounds pretty far outside of what would be considered reasonable

feral imp
#

His main research interests are efficient algorithm development, developing automated pipelines for biological data analysis, epigenetics, phylogenetics and, more generally, finding creative solutions for a wide range of bioinformatics challenges.

glad urchin
#

whatever they're cooking right now is no joke

slim sequoia
#

Is the guy trying to print a human genome in hard case binding?

feral imp
#

But people are really into this.. mega docs thing... Just the other day, someone had 250 pages note document........

slim sequoia
#

(which NGL would be pretty cool)

feral imp
glad urchin
#

someone had 8k pages on #quick-questions the other day

#

lol

#

and this one idk

#

this one is just astronomical

#

i have no idea how many pages that would be

slim sequoia
#

My second question is how the fuck did he have the patience to wait 24 hours before thinking perhaps he should ask if that's a normal amount of time

feral imp
#

If you told me.. oh typst would be used to generate mega pages documents a year ago, I would have insulted you REPEATEDLY.

feral imp
glad urchin
#

i assume biological simulations arent usually the cheapest stuff

slim sequoia
feral imp
slim sequoia
#

👀

#

Lots of innuendos in this server today

sly pecan
#

@glad urchin they added some more information. I have to admit I don't really understand what they're talking about.

glad urchin
#

guess our estimates were way off

#

lol

sly pecan
#

They haven't compiled the entire document yet though

#

Anyway do I understand correctly that they have 2.5 million calls to underline ()?

glad urchin
#

i mean

#

i guess using #let would make parsing faster

#

but thats about it

#

memoization would work the same way regardless i think, since the parameters are the same

left night
sturdy sequoia
#

@left night does Args need to contain an EcoVec of items?

#

couldn't it just as well be a Vec?

#

is it cloned often? and does it matter when it is cloned?

#

because I would expect that it doesn't get cloned often

#

Ah I see, Vec is bigger than EcoVec

#

I am assuming that's the moitivation here

#

Interestingly, increasing the size of Args does not increase the size of Value

#

That being said, on my machine it makes no difference

#

:/

left night
sturdy sequoia
sturdy sequoia
#

since Args are more mutably accessed than anything else afaik

left night
#

some strange optimization

sturdy sequoia
left night
#

maybe

#

seems more likely

#

since rust isn't that smart usually

sturdy sequoia
#

IS THAT WHY THE KID TALK ABOUT RIIZ?

#

/s

left night
#

zig?

sturdy sequoia
left night
#

hadn't seen RIIZ before

sturdy sequoia
#

I just invented it 😎

left night
#

^^

sly pecan
sly pecan
#

I'm confused

sturdy sequoia
# sly pecan I'm confused

I think this person assumes that we could just "render page n" and it would be faster than exporting a whole document

#

but it isn't

#

since exporting is actually among the fastest steps

sly pecan
sturdy sequoia
sly pecan
#

I feel like there's a different reason why it doesn't compile

sturdy sequoia
molten kayak
#

Looking at their script, they are generating ~10000 tags (there could be duplicates, that should be pretty rare), each of which associated with a paragraph (considering 4 paragraphs a day this would take 6-7 years to write) of 100 words or tags (with tags being ~1/5 of the total). That's roughly 800000 words and 200000 refs in total.
If you bring down the number of tags to 1000 then it takes ~15 seconds (I'm on the latest release, it might be faster on main) and produces 540 pages, which shouldn't be that bad.

sly pecan
molten kayak
# sly pecan When you say tags you mean labels?

The script calls them tags and generate a #tag[a_word] for each of them. I haven't dig into what that function does but I guess it creates a label with that word, because the same words are used as refs with @a_word

sly pecan
#

@sturdy sequoia seems it's introspection then. That's a lot of labels, but I still would expect it to compile

sly pecan
molten kayak
sly pecan
#

Try 2000?

molten kayak
#

Ah I should also mention that those 15 and 5 seconds are with --timings (which I guess might slow down compilation a bit)

#

Well, with 2000 it took 37 seconds to split out a bunch of errors because the script didn't ensure that labels are unique...

sly pecan
#

Maybe you could reply to their issue?

molten kayak
#

Btw if I modify the script to ensure unique labels it doesn't change much the timings (it takes 40 seconds to compile with 2000 tags)

sturdy sequoia
#

I guess we could for large numbers of labels to parallel shenanigans, but that seems overkill

sturdy sequoia
molten kayak
#

I've also tried the version of typst in main and it seems much better. It takes ~5 seconds for 1000 tags (vs 15 seconds for the latest release) and ~2 seconds for the 1000 tags without refs (vs 5 seconds for the latest release). The full 10000 tags version takes 1m30s, which is kinda slow but not that bad for that amount of text. Also seems much better than the previously predicted 2m30s for the latest release (which also assumes linear scaling).

sly pecan
gritty hazel
# glad urchin lol

Thought I'd report back on this one, I'm the dude who had the 8k doc 😄. After making the tweaks you recommended @glad urchin of removing the grids I went from 32GB memory consumption down to 24GB memory consumption all the while the doc has grown to 9k pages! So we're looking good, thanks for the help!

glad urchin
#

Sounds great 👍

sly pecan
gritty hazel
quaint blaze
feral imp
low sapphire
#

Is there no way to limit the memory usage?

#

Less threads? Can you even set that manually?

left night
sturdy sequoia
#

(it needs to be used, right, but I'll try to open a PR this week that uses it)

sly pecan
left night
sturdy sequoia
sturdy sequoia
left night
#

not sure whether it would work on wasm

#

but maybe later, some manual heuristics go a longer way initially I think

sturdy sequoia
#

since it would need to check whether the call is already in the cache?

left night
#

but I now think it wouldn't really work out well

#

cause it would be static per function and e.g. call_closure just is unpredictable in that way. sometimes the function is expensive, sometimes its cheap.

#

so whenever the opt-out would work statically, we might as well remove the memoize in the source code

sturdy sequoia
#

that says whether that closure is memoizeable, and this boolean is set based on timings?

#

for user-code at least it should help!

left night
#

but it would be a good approximation I think

#

maybe something with a few more bits than a bool

#

so that we can say like "if this was cheap a few times in a row, skip it"

#

but maybe a bool is actually better because it kicks in faster

sturdy sequoia
#

I was thinking of having the concept of a Value::len that gives an approximate length of the items which gives us a hint towards the hashing cost & running cost of whathever function is being called on it

left night
#

but a function can be very cheap even if the value is big

#

e.g. array.at

sturdy sequoia
#

Yes but the hashing it still there (if it's a closure)!

#

That's the trick, it doesn't remove the expensive bit

#

unless we make a LazyEcoVec that contains the hash on the heap too 😂

left night
#

the bool could skip the hashing

sturdy sequoia
#

Indeed, but my idea with len is that we can check if it's going to be costly to hash

left night
#

overall, if we can somewhat reasonably observe the runtime behaviour, it will be always better than a hand-written heuristic

#

at least that's my theory

left night
sturdy sequoia
#

I agree

sturdy sequoia
#

Perhaps having two bools: one that tells us when len is big is slow, and when it's fast

left night
#

sounds a little over-engineered tbh

sturdy sequoia
#

yeah probably

left night
#

what's the problem with a single bool that says whether it's cheap?

sturdy sequoia
#

Yeah no, it's good

left night
#

we would need to measure to find out whether there are too many cache misses due to it

sturdy sequoia
#

It's fairly easy using a MISS_COUNT and HIT_COUNT, I've actually used this in the past for measuring cache efficiency

left night
#

wait, are we talking about the same kind of cache?

#

I meant comemo's cache, not the CPUs

sturdy sequoia
#

I meant in comemo

left night
#

ok

#

so we would check enabled = !func.is_cheap()

#

and then in the body unconditionally update is_cheap after the function, correct?

#

idk whether we can measure anything on wasm though

sly pecan
left night
#

the 24h one

#

which is still open ...

left night
#

fwiw, the hypermedia systems book goes from 6.5s (0.11.1) to 1.8s (main) on my machine. pretty nice win.

#

it also found a panic in main ... but I fixed it

feral imp
#

Double win.

south apex
stone pilot
#

@pearl sedge very nice! and I didn't know about shiroa, which is also very cool!

sturdy sequoia
#

Ok, I was watching this conference and I just got ideas on how to make the VM heaps faster

slim sequoia
#

ooo

untold turret
#

You could prepare a gift, the vm-3, for laurmaedje, who's on vacation.

sturdy sequoia
#

then a VM-3 based on that

#

Currently the big problem of the VM are:

  • Complexity
  • Non-parallel compile (something I'd like to solve)
  • Inefficient data structures
#

With flattening I fix three, and reduce 1, additionally, during flattening, I would look for static path and launch compilation of those

untold turret
#

what's ast flattening? a sequence or a thinner tree of IR converted from ast?

sly pecan
#

😂

sturdy sequoia
#

usize -> u32 (size reduction) and more cache friendly

untold turret
lost meteor
#

The node struct in typst-syntax is actually meant to be the same structure as in Rowan, and we have an incremental parsing algorithm in reparser.rs, but I'm not sure how optimal it is

sturdy sequoia
#

@left night What about a hotness and depth based system for disabling memoization?

#

If the function is called within a function that is cold (rarely updated), its sub functions could be skipped from memoization

#

essentially trying to memoize at the highest level we can as often as possible

#

If the caller is not hot, we disable memoization

#

(locally right)

#

Unless the function itself is very hot (called often)

left night
#

since we don't initially know which functions are hot, isn't that something that should rather be handled by eviction?

#

if a sub-result isn't reused quickly, it will get evicted

sturdy sequoia
#

Hmm, I suppose yeah

left night
#

and then it will not be computed since higher levels are memoized

sturdy sequoia
#

I guess that's true that eviction will evict small functions quickly-ish

#

But if they were evicted previously, we could skip memoization

left night
#

doesn't help with peak memory usage after first compilation, but after a few seconds it kicks in

left night
#

and then something that was previously cold is suddenly hot

sturdy sequoia
#

That's the tricky bit and the point of keeping track of "hotness"

#

which can be integrated into comemo

left night
#

yeah

sturdy sequoia
#

so it would be something like enabled = hotness > something or not_previoisly_evicted the problem is checking for whether it was previously evicted (could be at the function level, but it would still require hashing arguments to perform that check intelligently)

#

Other thing we could try: we could evict more quickly if the function is deep in the call stack

sly pecan
#

Have you tested regressions recently @sturdy sequoia ? Curious if all the big layout stuff has had any impact

sturdy sequoia
#

@lunar kettle when you have time, can you test the PR I have opened, #4871 it should fix all Oklab colors in PDF (not the way I'd like...)

lunar kettle
#

will test right away

sturdy sequoia
sly pecan
#

Yes

lunar kettle
#

dw the final thing before going to bed 😂

sturdy sequoia
lunar kettle
#

I was working on my pdf crate 👀 and it's soon ready for the first release I think, hehe

sly pecan
#

For instance this

sturdy sequoia
lunar kettle
sturdy sequoia
#

Well isn't that nice

lunar kettle
#

thsis should in theory make it possible to encode as cmyk too, right? even if it's a bit lossy

#

for the future I mean

lunar kettle
sturdy sequoia
#

there was a PR to fix it

sturdy sequoia
#

The PR is well titled 😂

sly pecan
#

Did you see the pr I linked above?

sturdy sequoia
sly pecan
#

Speaking of which, I just ordered computer parts, including a 7800x3d

#

Couldn't really bring my computer with me to the us

#

Though I did bring a PS5, XSX, GPU, ram and ssds 😂

sturdy sequoia
#

Using a super simple "best of 10" approach, I get the following results:

 commit    cold   t1    t2
c4dd6fa0 = 1.99s 467ms 115ms
a7c4aae3 = 2.26s 440ms 117ms
#

So there has been a significant slowdown on masterproef

sly pecan
#

Oh no 😦

sly pecan
#

You had a 7950x3d?

sturdy sequoia
sly pecan
#

Yeah

sturdy sequoia
#

Congrats ❤️

sly pecan
#

Since august 16

#

Thanks

sturdy sequoia
#

and kind of pissing me off

#

4 sticks of RAM OC'ed on this platform just isn't that nice

sly pecan
#

As in no Expo?

sturdy sequoia
#

yes

#

I had managed to make it super stable but at the cost of insane temps on the I/O die by manual tweaking of timings

sly pecan
#

4 sticks of ram is a shitshow on any platform. Honestly they should just ship motherboards with only 2 slots

sly pecan
#

I'm doing an itx build

#

In a fractal terra

sturdy sequoia
#

I used to have a Ryzen 2950x with eight sticks, that thing was rock stable

#

||that gif was annoying me||

sly pecan
#

Didn't wanna have to deal with all the issues with the multi CCD

#

So that's why 7800

sturdy sequoia
#

(on the 2950x)

sturdy sequoia
#

just Windows being windows

sly pecan
sturdy sequoia
#

I like speaking to you

#

but it's almost 1:30 and I am tired

#

😪

#

Have a good one ❤️

sly pecan
#

Nighty night

glossy shore
sturdy sequoia
glossy shore
#

I don't follow

sly pecan
feral imp
#

Sounds more like a call to not do too much performance testing.. But a quickie The Thesis performance stats would be informative..

lunar kettle
#

ill give it a try on my computer

#
Before:
typst c main.typ --root ../ --font-path ../fonts  7,67s user 1,12s system 323% cpu 2,721 total
typst c main.typ --root ../ --font-path ../fonts  7,68s user 1,20s system 320% cpu 2,775 total

After:
typst c main.typ --root ../ --font-path ../fonts  7,43s user 0,96s system 316% cpu 2,647 total
typst c main.typ --root ../ --font-path ../fonts  7,52s user 1,10s system 309% cpu 2,782 total
typst c main.typ --root ../ --font-path ../fonts  7,45s user 0,99s system 316% cpu 2,668 total
lament fulcrum
sly pecan
#

I'm surprised it's actually faster. But that'll depend on the document I guess

lunar kettle
#

will try again

lament fulcrum
#

should maybe also test a document with lots of show text rules?

glossy shore
#

assuming it even still gives the correct results

glossy shore
#

lol 😆

#

I mean this is still a breaking change

left night
#

*bug fix

#

but I get what you mean now

glossy shore
#

even if the new behaviour would be less surprising any documents written for the old one could have worked around it in an incompatible way

left night
#

I just looked at the thesis before/after and it seems to be exactly the same. But I'm not sure it even uses text show rules.

glossy shore
#

nice

left night
#

But generally, I would not be surprised if some bug sneaked it. It's a complete rewrite after all.

#

Though I did rewrite it very incrementally, constantly running the tests

#

Something like 100 WIP commits that nobody will ever see

glossy shore
#

I wouldn't worry about that anyways, this is a change for the best

#

and Typst is in beta for a reason

left night
#

actually we dropped the whole "beta" wording ^^

#

but it's 0.x of course

#

the beta label was mostly for the web app

slim sequoia
#

From "beta" to "better" 🥳

glossy shore
#

wops

#

I wonder what a 1.0 might entail

slim sequoia
#

Holographic content realization - content appearance depends on what angle you loop at it (only IPS panel screens supported)

left night
glossy shore
#

cetainly imagined as much, don't be mistaken

sturdy sequoia
lunar kettle
#

eh it varies a lot

#

i think it should be about the same

sturdy sequoia
#

@cunning wadi Easiest 20% gains in masterproef compile time yet: upgrading my RAM so that I can use EXPO 😎

cunning wadi
#

Wait you weren't using EXPO/XMP?

sturdy sequoia
cunning wadi
#

I should check how fast my new computer is at mosterproef™️

sturdy sequoia
sly pecan
#

2*32 or 2*48?

sturdy sequoia
#

64GiB

placid pivot
#

There is 48*2?

#

I'm a bit late but yeah

sturdy sequoia
placid pivot
sturdy sequoia
#

But to be fair it wasn’t their own fault

#

Unlike Intel

sturdy sequoia
#

But it was motherboards not respecting the specs from amd about max voltage

placid pivot
#

Aaaaah I see

#

Like intel no?

sturdy sequoia
#

No it was purely the motherboard fault unlike intel whose guidelines lead to degradation long term due to poor design and a need to keep bumping the clock to stay competitive

sly pecan
shy sage
sturdy sequoia
#

Huh I didn’t know

sly pecan
sturdy sequoia
# sly pecan

not really, i'll survive with 64 Gig of ram as long as it's not crashy 😄

sly pecan
#

If that's what you tell yourself to sleep at night!

glossy shore
#

16 is too much ;-;

cunning wadi
#

okay I ran masterproef on main

#

1sec 847ms 406µs 833ns

#

on my new computer

sly pecan
cunning wadi
#

Horrible!

hoary dew
cunning wadi
#

I didn't

#

I mainly wanted to see how fast my new computer can compile

#

I can do so later though

feral imp
#

1 sec thesis is really nice.

#

Main is increasing capabilities, whilst still retaining respectable performance.

sturdy sequoia
#

I made the figures smaller and optimized a few things

sturdy sequoia
#

it should be no more than 100ms incremental even for 164 pages

feral imp
#

🤷‍♂️ I don't know man, performance would be nice, but we are getting a ton of features... And those are a bit necessary. Execution performance isn't necessary... Lower memory usage is though.

It is a balance.

Feel free to make typst fast again

sturdy sequoia
#

performance has gotten so tricky do optimize (multithreading really isn't helping :()

#

I found a few low hanging fruit but I think that's it

feral imp
#

... What about what laurmadje wrote in the PR... Let me find the link...

sturdy sequoia
#

I wonder if just re-allocating Vec for StyleChain would be more efficient that linked lists

#

I mean linked lists are known to be quite bad for CPU caches and allocations end up being fairly cheap

#

I guess the cloning into the new vec would be the most expensive

proud sandal
#

How many styles chains are there on average and how large are they

sturdy sequoia
#

one for each field set in a set but afaik they're grouped in an array

proud sandal
#

Mhm

sturdy sequoia
#

StyleChain is a linked lists of slices

#

But I guess in a large document that can get quite long

proud sandal
#

Do they share common nodes?

sturdy sequoia
proud sandal
#

Like if there's two code paths and each one is taken once

#

OK

sturdy sequoia
#

But there @left night is far far far more knowledgeable than me

#

I probably shouldn't be talking without confirmation

glossy shore
#

oh is it like a cons list?

molten kayak
#

I wonder if something like a persistent map would be more efficient. It would retain the sharing of the style chain for the nodes up the chain while being faster to index into.

left night
left night
sturdy sequoia
#

even with just two threads (the minimum) it's very difficult to interpret

left night
#

I think we could make it not use a worker if threads=1

#

that would make it easier to optimize single-threaded performance

sturdy sequoia
#

I think that would be better indeed, would make profiling with say VTune a lot easier

left night
#

Feel free to give that a shot

proud sandal
sturdy sequoia
sturdy sequoia
sly pecan
#

@sturdy sequoia you know the drill

#

🤣

feral imp
#

In my very unscientific testing on a few documents, performance was not affected much. While working on the PR, I primarily on focused on getting things right and not hurting performance too much, so that's a good result I would say. Fundamentally, doing relayouts is of course more expensive than not doing them.
Source: https://github.com/typst/typst/pull/5017

sly pecan
#

Yeah but he is the performance guru

sturdy sequoia
feral imp
sly pecan
sturdy sequoia
#

😮

sly pecan
sturdy sequoia
#

balanced as all things should be

sly pecan
molten kayak
molten kayak
# feral imp Polylux? Png?

Polylux initially, then switched to touying. To be fair though there are a bunch of fletcher diagrams and I reached 24GB while editing them (just compiling when using touying is much more manageable with 1-2GB now). The presentation is like 70 pdf pages, it would be 17 slides but with a bunch of pauses in slides with diagrams.

feral imp
#

Ok.

left night
#

many blocks very cached before (at least I'm fairly sure they where), but some weren't, in particular equations

sturdy sequoia
#

Coldly *

sturdy sequoia
left night
#

I have a math-heavy document I got from someone and incremental is 6x faster with this PR

sturdy sequoia
left night
#

on main, actually no single/multi blocks were cached, only user blocks

#

on 0.11.1, only LayoutMultiple was cached, not LayoutSingle (at least I think so, I didn't 100% validate that)

sturdy sequoia
#

Good news is that it’s now cached!

#

I have a math heavy doc of mine but it’s on the webapp so I wonder if it will benefit from it

glad urchin
#

I'll definitely be testing that on some of my own documents too

#

I do have a few with some math which eventually got slow in incremental , purely anecdotal though

#

And it wasn't that much to bother me, was just a bit more than usual :p

#

Thought it was normal for the size of the doc, but maybe it improves anyway 👀

glad urchin
#

so

#

while answering a forum post, i found one small document that got 7x slower on 0.12.0-rc1 for some reason, which is interesting

#

(from https://forum.typst.app/t/how-to-create-a-table-with-round-corners-like-in-rect-function/1051/)

#table(
  fill: (x, y) =>
    if x == 0 or x == 6 or x == 7 or y==1 { silver },
  columns: (0.45cm,0.45cm,0.45cm,0.45cm,0.45cm,0.45cm,0.45cm,0.45cm,),
inset: 2pt,
stroke: 0.2pt,
  align: center,
  table.header(
    table.cell(colspan:8, fill: silver)[*Dezember*],
    [*KW*],[*Mo*], [*Di*], [*Mi*], [*Do*], [*Fr*], [*Sa*], [*So*]
  ),
[48], [], [], [], [], [1], [2], [3],
[49], [4], [5], [6], [7], [8], [9], [10],
[50], [11], [12], [13], [14], [15], [16], [17],
[51], [18], [19], [20], [21], [22], [23], [24],
[52], [25], [26], [27], [28], [29], [30], [31],
)
#

0.11.1 -> 5 ms, 0.12.0-rc1 -> 35 ms

#

very odd

#

timings: (0.11.1)

#

0.12.0-rc1

#

seems like pad is more expensive now

#

okay i can confirm this happens even for this MWE

#table(
  columns: 2,
  [a], [b], [c], [d]
)
#

3 ms -> 10 ms (~3.3x slower)

#

big sad

glad urchin
#

okay false alarm, i blundered, sorry folks lol

#

testing environment was exceedingly unscientific

#

so we can ignore that

#

after deleting everything and starting from scratch it works fine so yeah

#

my fault 😄

untold turret
glad urchin
#

would be cool to have something more automated yeah

#

i would have probably used such a thing before posting anything here

#

haha

#

Basically the person in the forum thread reported a 10x slowdown by wrapping tables in blocks, so i tried to reproduce and noticed there was a slowdown even without blocks
But turns out i was just using some bad binary , maybe built without optimization or smth as it was in some random testing folder
Using the binary from gh releases fixed it (for the case without blocks)

#

With blocks there is a slowdown on both Typst versions , but nowhere near 10x for me

untold turret
#

Beyond that, it should be great to have something like typst-test bench that provides machine-independent statistics, then we can make such a repo with the tool. We can simply run the tools locally.

glad urchin
#

More like 1.5x or smth

sturdy sequoia
#

But it's tricky

sturdy sequoia
glad urchin
violet axle
#

For our practical university course, the time has also gone down from ~10s to ~3s (on a rather old pc). So thank you for the improvements!

unborn geyser
left night
#

@sturdy sequoia some interesting stuff I dug up about the history of pre-interned strings in rustc:

sturdy sequoia
#

Or I could just spend the two hours to make it by end lol

#

replacing the pico macro with (essentially) a static list of strings it matches over

left night
#

maybe we should first discuss what kind of strings we'd like to have statically interned in the first place

#

right now PicoStr is used exclusively for Labels

#

I think it would be nice to use it for all fields (possibly replacing the u8 field ids?)

#

I also would like to use it for HTML tags & attributes in the future

#

and I'd like to have the tags available in const contexts

sturdy sequoia
#

I know that using it for fields and function args actually makes a decent perf bump even with the current impl

#

(since I had done it for the VM notably)

left night
#

I think variables might also want to use it

#

i.e. anything in the LHS of a Scope

sturdy sequoia
#

Huh it's funny because I hadn't tested it, since the VM used IDs for variables

left night
#

we probably checked before, but are rust static strings guaranteed to be dedup-ed?

sturdy sequoia
#

maybe each scope could have its own list of IDs for variable (so as to not polute the global interner)

sturdy sequoia
left night
#

so we can't rely on pointer comparison for string literals?

sturdy sequoia
#

I am not sure

left night
#

also unfortunate that strings have alignment of 1

#

kinda prevents using padding bits for tricks

sturdy sequoia
#

This is the best source I can find

left night
#

but I guess distinguishing between static strings and runtime ones doesn't work anyway

sturdy sequoia
left night
#

we can't know if a runtime interned string also exists in the static strings

#

so we can't compare properly