#Performance
1 messages · Page 7 of 1
it's in a layout_grid in the last iteration
very interesting: since locate() provides context, I removed all uses of loc and replace with here() etc. and it's still fast
but flipping the switch and replacing locate with context makes it slow
yeah I know, I tried it too!
It's like something in the evalof a ast::Context is slow maybe?
but that doesn't make sense
it might be that the functions produced by context are harder to disambiguate for the locator or something like that
Interestingly, with the VM, both techniques take exactly the same time (which I would expect tbh)
yeah, probably you didn't port the minor difference that creates the problem
Maybe it's the CaptureVisitor?
it has a separate capturer
no, doesn't look like that would explain it
Ok, I found an awesome heuristics for whether to enable caching with the VM or not: the number of registers it uses
works surprisingly well
@left night I just had a realization: the VM isn't slower in incremental, it looks like multithreading has slowed down incremental even on main 
Maybe the VM isn't that bad after all
What a twist.
maybe we should be clever about when we do multithreading: if the page count is slow we can maybe do it single threaded to avoid the costs of syncing?
indeed
@feral imp Mind you, it's not perfect and there are definitely areas of improvement
compile time are still too slow imo
Back to work then :p
Wouldn't it make more sense to look back at a vm once other things have been optimized and stabilized first?
Sunk cost fallacy is real
what things? 
I'm just trying to save you from sinking more time into the vm only to realize it's not worth it again 😂
oh don't worry I won't
but I feel like there's a gotcha somewhere
I mean it used to be 5x faster and now it's only 2x faster
so clearly I made something slooooooooooooooooooooooow
what difference have you observed?
on main, from right before mt it was ~300ms for my standard test, and went to ~500ms afterwards (likely due to overhead of mt), and the VM takes the same ~500ms incremental due to re-compiling
mt = multithreading but I am too lazy to write it out
Memory bandwidth?
no, most likely synchronization, lemme test with a -j 8 if it's lower then I am right
Well no, this ain't it chief

conceptually, multi-threading shouldn't really have overhead in this case though
only page runs are multi-threaded, so if just one page run changes -> no multithreading
it's still slower mind you
yes but if you look at the --timings it does show you that it's doing... something:
not on my machine. a68a24157 is ~340ms incremental, 7fa86eed0 is ~326ms incremental for an edit in some paragraph in 4_phos.typ.
weird
I would need to do a bisect on a longer period then
I'll try and do that once I'm done working
'cause it is slower than it used to be
(or I am going insane)
why bisect? did you test main vs before or immediately after multi-threading vs before?
It may be, but I would be surprised since the workload is really the same
There are a few threads spawned that do basically no work (fully memoized) and one thread does the work of the changed page run
I was on some commit before, and right after
but I don't know which commit before
so I need to do a proper bisect between v0.10.0 and main
which won't be fun 😭
it's true that AMD CPUs struggly with memory latency (although not bandwidth) whereas the Apple Silicon M1 had unified memory
and I am on Windows
which is yucky
Almost 50% of the total time on my thesis is eval
🧐
(this is the wonder of the Intel VTune oneapi)
And guess where most of that is spent
||you guessed it, it goes in the hashing hole||
The fact that actually running code take 23ms is worrying frankly (out of 4s)
Is this actually true? When I looked at timings from the VM recently, time spent on compile vs eval was relatively balanced. It was just the final eval that was fast because most eval was in imports and thus a child of a compile_module (or in context)
yes it's true, actually running opcode takes 23ms, it's hashing that takes all of the rest according to a VTune run I did where I added "tasks" which allows tracking more precisely where time is spent
Mind you, there are other calls it does that take longer, but overall, hashing accounts for the vast (~80%) of time spent using hardware-based counters so I would expect it to be somewhat accurate
What does typst use for hashing now?
SipHasher 1-3 iirc
or 1-4
I would argue there's not much leeway in changing that 😐
It's mostly maloc and caching stuff
eaither Hasher::write or stuff in comemo
clearly we can do better
128 bit right iirc?
yep
If we could reduce it to 64-bit maybe it would be faster but I don't know how much, and I am not sure it would be worth it even
There's https://github.com/Cyan4973/xxHash but it's a c library
I don't think that would be a good idea
The problem is also quality, we need a very high quality hash
although perhaps having 128-bits decreases the need for quality somewhat
there's also gxhash, but I tested it and saw no performance improvements
although it has had a lot of commits since then so perhaps it has improved
That one is apparently unsound
The quality section does indicate it's fairly good. Though I have no idea about theoretical guarantees
I should also mention: they still don't have a fallback version of the algorithm
meaning it only works on ARM & x64
See https://www.reddit.com/r/rust/comments/1d5yq2l/gxhash_an_extremely_fast_hardwareaccelerated/l6pcd02?context=3 @sturdy sequoia
No wasm you mean?
yep
I thought wasm had simd
I did open an issue early on
I thought you meant xxhash
Did you see the reddit comment I linked? I think gxhash is a non-starter
it is for multiple reasons 😂
https://crates.io/crates/xxhash-rust @sturdy sequoia
I'll give it a whir 😉
I have no idea what i'm talking about, fyi
But it seems promising
I'm testing it, we'll know in a couple of minutes
@sly pecan it's slower, by quite a margin in fact
likely because it handles smaller inputs less well than SipHash 1-3
for the thesis: it's 0.2s slower on average compared to the VM, for the raytracer: it's 4s slower so 19s compared to 15s with SipHash 1-3
at least on my hardware *
Is this with the simd features enabled?
I mean there aren't many feature flags?
I enabled xxh3 for 128-bit support and that' sit
and doing the whole -C target_cpu=native is cheating imo
since we can't distribute that
I thought the idea was that small stuff wouldn't be memoized?
yes but the individual inputs (i.e pieces of data) that we pass to sh13 are smol
Ok
I think that a good way of determining whether we need memoization would be to have a .len() method on Value that gives us an idea of the size of the input, that way if we have large arrays then we can check those, if we don't then we skip memoization, etc.
maybe more something like .size-hint()
What is the talk about one shot vs streaming?
In the documentation
ah, it does look like it uses different algos then
using the Xx3 struct will make it streaming, using the xx3() function will make it one-shot
no idea if that has an impact
Well more like x64-v3 in order to enable avx etc
yes but native will select based on the hardware of the user 😉
To be fair, we could have a "launcher" that selects a typst version with or w/o AVX for better performance
I tested, using target_cpu makes zero difference
🤷♂️
A new much faster hasher has just been integrated into Rust. Maybe this can benefit Typst. Let me try to find a reference to that information.
No, sorry, the last changes were to the sorting algorithms, not hashing.
Yes indeed, mind you we do perform some sorting and the new algorithm will probably perform slightly better
They did switch to a different hash function (from fxhash to a wyhash-inspired one)
This isn't really an option for Typst though, because wyhash (not sure about their modification) only has 62 bits of collision resistance
How big a contribution does shaping have towards compilation? Are we talking miniscule?
yeah probably
Depends on how text heavy the document is. If it's mostly text, it is a fair amount. If you have a lot of other stuff, not that much anymore.
Optimized linebreaks can really suck up performance
Big performance improvements for that are coming soon :)
❤️
Just had an idea recently on how to optimize the optimization
And it's already mostly implemented, just needs polish
I wouldn't say minuscule, but much lower than hashing, image encoding, image decoding
LET HIM COOK
it's optimizations all the way down
Latest rustc-hash from 3 weeks ago also has interesting speed improvements. Not sure though if these are suitable for typst https://github.com/rust-lang/rustc-hash/blob/master/CHANGELOG.md
It looks like it's also 64bit. Probably the same wycats one used for the sorting improvements
yep that's the one I meant
a quick ctrl-f in the design documents of the new sort algorithm tells me that there is no hasher included
EDIT: no hash found in source code either
and it would honestly surprise me if it was because the sort function only takes Ord
(if anyone's curious, here's the PR: https://github.com/rust-lang/rust/pull/124032)
No, it probably uses the rustc-hash one, which is already an existing dependency.
It's a pity we can't find easy wins 🥹
How do you tell?
The Fxhash implementation came from that crate as that's the one used in the compiler.
I find amazing that Typst needs higher colision resistance than the compiler?
comemo needs perfect hashing as when there is a collision, typst may not (re)compute something and cause a wrong compilation.
Yeah but weren't you talking about sorting?
Yes, I first misremembered the sorting changes as related to hashing improvements, but those are separate improvements. Sorry for the confusion!
FYI the Rust compiler uses multiple hashes. For normal HashMaps (which do verify whether a collision happened or not) they use rustc-hash. For on-disk incremental cache they only store the hash (thus risking collisions, like comemo) and for that they use SipHash with a 128 bit output.
I just had an idea. If hash slow, what if typst no hash typst eq?
There are a few different types of objects I'd say: the Copy ones which are usually O(1) comparable, the Rc ones (any reference counted objects: Arc, Eco*, ...) which are hashable, and the comemo tracked objects. I think the main issue is the second type (most typst values).
Consider PicoStr. These are interned strings managed by a specialized interner. Can comemo be the interner for Rc-type objects? When function is called, comemo hashes the arguments, then looks up the result, and returns the result. In a sense, comemo "manages" the return value (it won't be dropped unless comemo decides to evict the returning call). What if comemo will apply the same interning as with PicoStr, but for all Rc-type hashable returned objects? Actually, what if comemo will manage input objects too?
Consider a CRc type (comemo reference counted) which is Arc but with one extra atomic bit indicating whether it is a canonical instance of the object. When calling a memoized function comemo would check if the bit of CRc argument is set, and if it is not, lookup the canonical instance of the object by hashing it and looking it up. It will then only use the location of that canonical instance in memory, instead of the hash. Then it will return the same canonical instance of the return value, if applicable.
The reason why this optimization might be good (if possible), is because now arguments are perfectly disambiguated by their memory locations. You no longer need high quality hash functions for interning (because Eq-ing objects one time per each instancing is acceptable), and you no longer need high quality hash functions for looking up the cache too (because all arguments are basically numbers now, and you can Eq arguments too). And it's not like you do much more work anyway: you hash everything before interning, but comemo would do that anyway. Although it's hard to estimate the memory usage.
That's good. Further optimization would presumably also make performance headroom for more microtypography.
Could the choice of whether to hash or not be based on size?
I'm too stupid to understand
I think it can be, but it's probably extra work with little to no benefit. If the total size of the object is 1 kilobyte, that 1 kilobyte must be crunched through eq-check either way: be it hash or eq. And that size cannot be determined at compile time, because most objects are build from collections (which have dynamic size, obviously). This leaves us with simple structs, which are usually Copy. That's why I kind of distinguish only between Copy and Rc-types, and not based on size.
I was thinking of memory usage.
In this case I am confused, I don't see how it will help memory usage 🤔
Aren't the hashes much smaller?
I should stop talking 😂
Oh, I get it now. Yeah, it might be a good idea, but a small counter argument is: if you have a big object, you probably haven't built it yourself. It was probably returned by a function, and this function is memoized anyway. So it probably won't help much, but one can never know before one measures so 🤷♂️
Wow, that's significant. Do you have any examples of performance deltas in real-world documents?
@sturdy sequoia The Thesis
The thesis isn't really bounded by optimizing paragraphs
I tested it with an earlier version and it was a few percents I think
I didn't think so, but it should have some effect
That's still a nice gain
I think it will be most impactful with book projects
No cetz, no crazy stuff, just tons of paragraphs.
Is it possible to use different strategies depending on paragraph length?
You mean a different way to bound the paragraph?
yeah
maybe. could also be that there's a smarter way to apply the bound. that's just what I came up with.
I guess there are diminishing returns. Few paragraphs have thousands of words anyway
yeah, I'm quite happy that it's this way around, rather than the bound only kicking in late
I also enjoy a lot that something that feels like algorithms class actually brings a lot of real-world performance here.
I did a course on bounds for graph vertex cover optimization in university, which was somewhat similar.
roughly 2.85s -> 2.65s
"Unfortunately, this approach of computing widths falls down with proper text shaping"
What does luahbtex do when using harfbuzz?
hey, that's pretty good
I wondered that too but haven't checked yet!
In my testing there is pretty much no difference on The Thesis™️
But that's okay
https://github.com/latex3/luaotfload/issues/152 discussion here may possibly be relevant
sounds like latex may cheat a bit?
would probably be easiest just to ask the latex and/or harfbuzz devs
Another consideration is microtypographical features, which will reduce the need for hyphenation
That "oi wiki" document would probably show a delta.
@untold turret
My hear tells me we should ping @onyx furnace right? You're the arbiter of the big-wiki-file, right?
yes. but i cannot find the zip file now😂 i can regenerate one if it's needed
it's mostly bottlenecked by plugins
😭
The version I have is too old to even compile on main @onyx furnace 😂
well, it's for once an optimization that's not tuned for the thesis
We have to make laurmadje feel better about his PR.

But but but.....
I feel good since its more than 8x for my test document.
that's pretty dang impressive
are there many docs that are bottlenecked by paragraph layout?
I would wager to guess that most documents that aren't bottlenecked by something else and use justification are bottlenecked by this.
Most prominently of course books
And test benchmarks of people coming from LaTeX ^^
I wonder what the thesis is bottlenecked by? 
'cause I have no idea at this point 😄
by images & codly
I guess I shouldn't be surprised 😄
I would've thought images were pretty fast
nope
I mean the multithreading helps
but on each thread it's still fairly slow
not much we can do there I am afraid
png?
yes
Are they high resolution?
decently yeah
I would love if we had impecable SVG support
it would be so good
No one needs more than 640x480
and no one will ever need more then 256MB of RAM and a 20GB Hard drive
It's mostly there isn't it?
Are we not impeccable yet 😠😂😂
I think it's more of a draw.io issue at this point 
indeed 😦
they're like "we only want to support the web broswer"
Like just add an option "inline everything" in your goddamn software ffs
@sly pecan I love how they marked your one comment in a draw io issue as a duplicate 😂
Care to send it? 😄
Wait what
Anyway off topic
I wish we had a draw.io like tool that generates Typst code
(side project anyone?)
general layout stuff I guess
@sturdy sequoia btw does vtune work with programs that only run for a few ms?
all the flows and pads and grids
no, use callgrind for those
you will get barely any samples at that point
yeah, codly and images too
but maybe actually no. most of those layouts end up in style or context
There is one page that takes 450ms per iteration all on its own and it's a big table 😄
in some cases, it's possible to reuse the encoded PNG or JPEG data. but it's a bit of work since there is no crate yet that gives such low-level access to the image data. not hard to write, but needs some time.
and probably somewhat error prone *
right?
not necessarily
if you want to be sure, you'd still need to decode the image, but not encode
Ah. Yes, you should feel good about this PR then.
yes, it does sound like it
XeTeX splits text at hyphenation points and then shapes only the parts between the hyphenation points. This ignores ligatures/kerning across discretionaries. Then these widths are used for linebreaking. After linebreaking, the horizontal lists are reshaped, this time taking kerning/liagtures across discretionaries into account.
so, it's sort of like if we used the approximate result directly instead of as a bound
which would speed up things much more. but it's a hack.
I feel like we can do it as you've done, no need for a hack that might results in incorrect linebreaking imo
agreed
The alternative would be more granular control than simple and optimized I guess
the thing is that the approximate layout can result in overfull lines
which, from my reading, XeTeX would also suffer from
How overfull are we talking? Many sins can be hidden by adjusting spacing
typically not much, but fonts can do random shit
Just blame it on the font if that happens 😈
that's the LaTeX way :p we don't have to since we do it properly.
I mean we do things properly while being heaps faster
so you know... we're right 😎
at least a lot faster than LuaLaTeX
pdfLaTeX is a matter of how you view it:
I think it's still faster for a single plain compilation with just a lot of text (not 100% sure), but you need to run it multiple times for outline etc. so I guess a single run is more akin to a typst watch cycle
What's the deal with this jump? The slope seems to change after too
The post suggested memory bottlenecking, perhaps that's where the bottleneck becomes apparent in the new algorithm
Oh you mean that's where l1/l2 cache is full?
I'm not sufficiently technologically literate to know what that means or if its the case
I'm a chemist simpleton 😛
Upon further reflection, that doesn't make sense either, because the y axis isn't time, it's the number of lines built
It's just strange to me that there's such a distinct jump
it's probably a very specific line being approximated badly leading to a less tight bound
I assume that this particular bump is specific to lorem ipsum with default settings
That makes sense
most likely L3
L1/L2 have basically the same latency ||gross oversimplification|| most likely when it can't rely on L3 anymore and has to go to main mem
Memory isn't the culprit here, since the y axis isn't time
no but more memory means more latency means more slower
Well yes
But it's not relevant here
hmm maybe
The y axis is just the number of lines
I made the same mistake 😀
rip in pepperoni to the both of us
@sturdy sequoia does perfetto also always initialize fully zoomed in for you since a while? I think originally it showed the full trace by default but now I need to drag the slicer all the way to the right every time. I can't figure out what the cause is.
@left night yes, it's really gosh darn annoying
they must've shipped some bug
cause I deleted all local data
maybe this? https://github.com/google/perfetto/issues/810
yes I think so
@left night interestingly, on my thesis, I resized (using a script 😅) all of the images and it had zero impact on compile times! (probably thanks to parallel compilation)
But, removing the glossary calls in my code blocks saves almost 0.4s of compile time (1.8s -> 1.4s cold)
Also makes incremental quite a bit faster 🎉
Although it does save almost 1GB of RAM when compiling lol
time to write a more efficient glossary
Yes but this all makes me think that:
- There should be a basic "image" resizer in the webapp for convenience when uploading like "oh, you uploaded an image and it's quite large, wanna resize it?"
- We should have a quick guide on how to make fast docs
- And yes, we need a more efficient built-in glossary 😄
BTW @left night could I ask you (and it's likely a big ask) but to compare the VM and main in the webapp? (since obviously I can't do that), I am curious if the VM helps in WASM at all or not 🙂
The big advantage of the VM (in theory) is also that it could be JIT'ed in WASM
I'll try that on a weekend sometime.
Thanks 😄
@sturdy sequoia did the new thing affect The Thesis™️?
Which one?
I shall try it then
wouldn't also mind knowing how it it looks different if at all
If it's just a refactor it shouldn't change at all
Apart from bug fixes
in which case dherse whatever you do please don't tell me if it looks different at all
must've worsened the performance by quite a bit then
Y'all know I work like... a difficult job that you know, requires most of my brain power 😂
it's still compiling
I just got on my PC :-p
And then I was stuck in traffic for an hour
because even the f-ing E40 was clogged for some f-ing reason that blows my mind
I was joking. Take your time ❤️
Plus I thought I would get fired yesterday, which obviously didn't happen and I was wayyy over reacting
but damn he drained me
Whaaaaa
Yeah, dumb company was mad that I used a colleagues license for their software to try if it was the right fit for my work
they got into a bit of issy fit while we were on call (just me and them)
had to calm them down
and immediately talked to my management who said that they were a pain in the neck and that I reacted correctly
I was shooketh for sure
I wish, I meant "it"
😭
The gods hath spoken
almost done, just waiting on main to compile release
I also optimized The Thesis™️ itself
it compiles in barely 1s now!
But but
Before or after recent commits?
but it can vroom more
:3
Someone needs to keep track of compile times in a chart
No need for incremental compilation anymore
X axis release, y axis inverse of compile time
Stonks
@left night in shambles right now 😂
To be fair, incremental is sloooooooooooooooooooooooooow now
it takes 0.4s per iteration
which is slow compared to cold compile
I blame a lack of VM
It's gotten slower? Or just relative to cold
I'll do a custom build without comemo in a few minutes 😂
it's gotten slower at some point
but don't know which commit
Oh noes
sorry we had to intentionally slow down your thesis to ensure we can still run benchmarks on it
Note: PDF export on Windows native
so to me it does look like decent gains """for free"""
Wait no
I ran the VM runs with timings
DON'T LOOK AT THE RESULTS
branch cold inc
main pre: 1.63s 401ms
main post: 1.46s 406ms
vm pre: 1.45s 400ms
vm post: 1.36s 410ms
The VM is so f-ing useless on The Thesis™️ lol
Ok, There was only one bad result, it's fixed now 😉
note that "main pre" is a bit older than just before this commit afaik so the difference should be a bit lower
0.1s isn't bad
yeah I agree
Weird that incremental is comparatively slow
yeah incremental has become crazy slow at some point
but I don't know when
For some reason this citation is by far the slowest
taking 25 ms all on its own lol
I guess incremental performance is harder to figure out
Have you tried the speculative execution branch?
you mean in incremental? then yes but I don't remember the results
Yeah
Ok
Incremental has slowed down... But more stuff happens now.... You are tablex free now? But incremental is slower than it used to be.
😬
25 Ms for a single citation is ridonkulous
I optimized the heck out of my thesis (not pushed yet): smaller figured, replaced fixed refs to links (heaps faster), native tables, etc.
I agree, if a single citation took 80% of year to complete I would be worried too
😎
I blame autocorrect
what is pre and post here?
pre and post the PR you did to refactor something to do with lines
I still cannot reproduce that
I mean it used to take 100ms on my machine
what why did it speed up so much. the goal of the PR wasn't even performance.
maybe it's Windows specific
main pre is a tad older, the VM is more comparable here
sowwy
I mean, I did optimize some stuff, but I didn't expect any real-world gains
Did you mess up with environment variables and the package system maybe?
Happy little accidents?
let me see whether I can reproduce this result
Does your thesis use any package from the universe? It might be something like that... 🤡
incremental I wouldn't expect no
I mean 0.1s is good, no?
You should really only compare with the VM here because the main pre was a slightly older build of main
(I can't be arsed to do a new one)
That's noticeable, so that's very good. If it happened unintentionally... That's suspicious, but good?
it is good, this is what I'm surprised about
let me see, just compiling --release and it always takes so long
you did mention in the PR using some Cow maybe less cloning?
yeah, it's that codegen-units=1 at play here
And perhaps accidentally better caching?
maybe that too
@left night I think I actually found a """simple""" way of optimizing eval 
Without doing anything fancy
But y'all'll have to wait to know what it is 😈
for me, it's not really faster. It's pretty much the same.
I compared directly the commits immediately before and after the PR
I mean, for me it was repeatably faster by 0.1s and the run-to-run variance is basically zero
another AMD win 💪
or maybe it's actually Apple Silicon that dealt well with the crappy old code :p
flat ast by any chance?
I mean, it kind of is? 
No, it's Cow, with me it's always Cow
Ok, then it's suggestion for optimization. https://www.cs.cornell.edu/~asampson/blog/flattening.html
@left night isn't our AST flattened?
nope
oh, then it might indeed be worth it 
one might try it, but I do not believe that it will be faster
You tease
and it might make the code less maintainable
the rust kind
that's what I'd be afraid too
also it would mean a bit more overhead on incremental reparsing
but probably negligible
Does parsing actually have a performance impact?
not much I think
figures, are we parsing in parallel?
yeah figures
but if imports were parallel then we would
Indeed
if an import or include returned a Deferred<T> i GUESS
cause it returns a parsed Source
Sorry caps
But indeed I don't think World is easy to Send + Sync?
or we'd need a kind of ScopedDeferred<T>
using scoped threads
it already is?
hmmmmmm 
without send + sync world, no multithreading go brr
BTW, I just found a stupidly simple optimization, I'll open a PR for it
probably changes nothing in the real world
but it's so dumb I can't not open up a PR
Thesis is the real world (for now)
Pretty sure it's my smallest PR by an order of magnitude 😂
ah, that's a remnant of an older approach where the family name had to be kept around
figured, it's a complete micro-optimization but I think it's okay, at the end of the day, it has no cost or complexity and it's cheaper
on very constrained devices it might even make the tiniest of differences 🤷♂️
yeah, it has no cost.
@sturdy sequoia how much of the 400 ms is export?
Like 80ms something like that
Found a replacement for @sturdy sequoia 's thesis https://github.com/typst/typst/issues/4501
well
that's a lot of linebreaks
idk but this seems like an instance of https://www.youtube.com/watch?v=ibjLxdp6qg0
however i cannot reply with that video unfortunately
😂
if it has ultralong paragraphs, this is actually a case where my latest PR could help because they might be running into quadratic runtime
but still a very valid question what the heck they are doing
I wonder how fast it would compile on a mid-range cpu^^
I suspect they don't have enough memory to compile the document
Dherse has insane specs IIRC
but then it would be killed right?
given they're on a mac, im assuming it's using a ton of swap
Wouldn't it go to the swap file?
iirc macos just doesnt care and will fill your whole disk with swap if needed
lol
so they might be the first person to have typst consume hundreds of gigabytes of memory. or something
i already thought my 10000 tables and grids consuming 60 GB of RAM was bad, but then someone in #quick-questions was doing this like for real
and now we have this
if anything this shows that the compiler is very resilient 😂
Yeah this document is likely on the order of 1 TB of memory usage or more
i wouldnt doubt it
especially since they said there are footnotes and whatever
i wonder if they're using something like AI or whatever to generate such a document
cuz theres no way i can think of someone actually typing all that
Based on their description and their profile I suspect it may be a genome or something?
hmmmmmm
their profile does suggest they'd do this kind of stuff yeah
😂
paging @feral imp for an assessment
I guess at some point reducing memory usage should be looked into. But this particular one sounds pretty far outside of what would be considered reasonable
Do it
His main research interests are efficient algorithm development, developing automated pipelines for biological data analysis, epigenetics, phylogenetics and, more generally, finding creative solutions for a wide range of bioinformatics challenges.
whatever they're cooking right now is no joke
Is the guy trying to print a human genome in hard case binding?
But people are really into this.. mega docs thing... Just the other day, someone had 250 pages note document........
(which NGL would be pretty cool)
those are rookie numbers
This is like 150 000 pages
Yes! I'm following.
someone had 8k pages on #quick-questions the other day
lol
and this one idk
this one is just astronomical
i have no idea how many pages that would be
My second question is how the fuck did he have the patience to wait 24 hours before thinking perhaps he should ask if that's a normal amount of time
If you told me.. oh typst would be used to generate mega pages documents a year ago, I would have insulted you REPEATEDLY.
Bioinformaticians have patience of a newton cradle.
yeah i was gonna say, this is probably right up their alley
i assume biological simulations arent usually the cheapest stuff
(TBF I'd say a Newton's cradle level of patience is more like ADHD: short loud bursts surrounding long periods of stagnation)
It is hard for me to be maximally entertaining all the time. Sometimes, it is just noise 😛 🤣
@glad urchin they added some more information. I have to admit I don't really understand what they're talking about.
You mean memory?
They haven't compiled the entire document yet though
Anyway do I understand correctly that they have 2.5 million calls to underline ()?
i mean
i guess using #let would make parsing faster
but thats about it
memoization would work the same way regardless i think, since the parameters are the same
I don't think that's actually the case. Built in functions are not memoized by default, only user defined ones. So the fact that they didn't define a closure might actually conserve memory.
@left night does Args need to contain an EcoVec of items?
couldn't it just as well be a Vec?
is it cloned often? and does it matter when it is cloned?
because I would expect that it doesn't get cloned often
Ah I see, Vec is bigger than EcoVec
I am assuming that's the moitivation here
Interestingly, increasing the size of Args does not increase the size of Value
That being said, on my machine it makes no difference
:/
yeah since it's in Value it must be <= 24 bytes. but with arg sinks it's also cloned from time to time.
yeah but some heavy(er) cloning for rare cases is fine imo
weird
since Args are more mutably accessed than anything else afaik
some strange optimization
or just some rust-analyzer quirkiness
RIIZ
IS THAT WHY THE KID TALK ABOUT RIIZ?
/s
zig?
yes that's the joke
hadn't seen RIIZ before
I just invented it 😎
^^
Description Brief Provide a partial evaluation API, with semantics: generate PDF from page $i$ to page $i+m$. This feature makes sense only if intermediate results can be offloaded to disk storage....
I'm confused
I think this person assumes that we could just "render page n" and it would be faster than exporting a whole document
but it isn't
since exporting is actually among the fastest steps
yeah, though it's not even clear what they wanted, since they added offloading to storage into the mix
Probably hopeful we could do something that fits their exact workload
4*365 paragraphs isn't that obscenely huge, probably on the order of 500 pages?
I feel like there's a different reason why it doesn't compile
Indeed, doesn't seem that huge imo
Looking at their script, they are generating ~10000 tags (there could be duplicates, that should be pretty rare), each of which associated with a paragraph (considering 4 paragraphs a day this would take 6-7 years to write) of 100 words or tags (with tags being ~1/5 of the total). That's roughly 800000 words and 200000 refs in total.
If you bring down the number of tags to 1000 then it takes ~15 seconds (I'm on the latest release, it might be faster on main) and produces 540 pages, which shouldn't be that bad.
When you say tags you mean labels?
The script calls them tags and generate a #tag[a_word] for each of them. I haven't dig into what that function does but I guess it creates a label with that word, because the same words are used as refs with @a_word
@sturdy sequoia seems it's introspection then. That's a lot of labels, but I still would expect it to compile
What happens if you try to compile the original with 10 000 labels?
I did try it and gave up after leaving it on for a minute. I guess it would take at least 10 times the amount of time it takes with 1000 labels, so at least 2 minutes and a half
It would surprise me if it scaled linearly
Try 2000?
Ah I should also mention that those 15 and 5 seconds are with --timings (which I guess might slow down compilation a bit)
Well, with 2000 it took 37 seconds to split out a bunch of errors because the script didn't ensure that labels are unique...
Okay, that might be part of OPs problem
Maybe you could reply to their issue?
Done
Btw if I modify the script to ensure unique labels it doesn't change much the timings (it takes 40 seconds to compile with 2000 tags)
Okay but even then, how would we improve this?
I guess we could for large numbers of labels to parallel shenanigans, but that seems overkill
depending on the CPU timing can have basically no effect, on my Desktop (7950x3d) it slows down my thesis by maybe 0.1s
I've also tried the version of typst in main and it seems much better. It takes ~5 seconds for 1000 tags (vs 15 seconds for the latest release) and ~2 seconds for the 1000 tags without refs (vs 5 seconds for the latest release). The full 10000 tags version takes 1m30s, which is kinda slow but not that bad for that amount of text. Also seems much better than the previously predicted 2m30s for the latest release (which also assumes linear scaling).
That doesn't seem unreasonable
I guess there's nothing to worry about for the time being
Thought I'd report back on this one, I'm the dude who had the 8k doc 😄. After making the tweaks you recommended @glad urchin of removing the grids I went from 32GB memory consumption down to 24GB memory consumption all the while the doc has grown to 9k pages! So we're looking good, thanks for the help!
Sounds great 👍
Have you tried to compile on main?
Not yet
24 gigs of ram to compile a pdf :ferrisBallSweat:
8k pages is crazy on its own.
Is there no way to limit the memory usage?
Less threads? Can you even set that manually?
threading is only used on main, there you can use -j
none, the enabled argument for comemo was just released, we'll see if that helps 😉
(it needs to be used, right, but I'll try to open a PR this week that uses it)
Is that just a binary switch for the entire document?
no, it's a condition that's evaluated for every memoizible function call, whether to actually memoize
No, it would be per-function, for example disabling memoization on small functions, or small layout calls, these kinds of things
Manual, or based on a heuristic?
Heuristic hopefully
I still think we could do something time-based
not sure whether it would work on wasm
but maybe later, some manual heuristics go a longer way initially I think
but that would still require hashing the arguments, no?
since it would need to check whether the call is already in the cache?
my original idea was that we could opt out of hashing when we detect that the function call is almost always cheap and go back to caching if we observe that it got expensive
but I now think it wouldn't really work out well
cause it would be static per function and e.g. call_closure just is unpredictable in that way. sometimes the function is expensive, sometimes its cheap.
so whenever the opt-out would work statically, we might as well remove the memoize in the source code
Perhaps we can do it as an AtomicBool in Closure?
that says whether that closure is memoizeable, and this boolean is set based on timings?
for user-code at least it should help!
not a bad idea! although there might of course still be closures that are sometimes very cheap and sometimes very expensive.
but it would be a good approximation I think
maybe something with a few more bits than a bool
so that we can say like "if this was cheap a few times in a row, skip it"
but maybe a bool is actually better because it kicks in faster
I was thinking of having the concept of a Value::len that gives an approximate length of the items which gives us a hint towards the hashing cost & running cost of whathever function is being called on it
Yes but the hashing it still there (if it's a closure)!
That's the trick, it doesn't remove the expensive bit
unless we make a LazyEcoVec that contains the hash on the heap too 😂
the bool could skip the hashing
Indeed, but my idea with len is that we can check if it's going to be costly to hash
overall, if we can somewhat reasonably observe the runtime behaviour, it will be always better than a hand-written heuristic
at least that's my theory
what would you do if Value::len is large?
I agree
that's the problem, there is an argument either way 😂=
Perhaps having two bools: one that tells us when len is big is slow, and when it's fast
sounds a little over-engineered tbh
yeah probably
what's the problem with a single bool that says whether it's cheap?
Yeah no, it's good
we would need to measure to find out whether there are too many cache misses due to it
It's fairly easy using a MISS_COUNT and HIT_COUNT, I've actually used this in the past for measuring cache efficiency
wait, are we talking about the same kind of cache?
I meant comemo's cache, not the CPUs
I meant in comemo
ok
so we would check enabled = !func.is_cheap()
and then in the body unconditionally update is_cheap after the function, correct?
idk whether we can measure anything on wasm though
Should I just close https://github.com/typst/typst/issues/4560 ? It doesn't seem like an actual issue, apart from somewhat high memory usage
feels similar to the recent issue
the 24h one
which is still open ...
fwiw, the hypermedia systems book goes from 6.5s (0.11.1) to 1.8s (main) on my machine. pretty nice win.
it also found a panic in main ... but I fixed it
Double win.
Nice!
I didn’t realize that book is written with typst; cool!!
blog post is out
https://dz4k.com/2024/new-hypermedia-systems/
@pearl sedge very nice! and I didn't know about shiroa, which is also very cool!
Ok, I was watching this conference and I just got ideas on how to make the VM heaps faster

ooo
You could prepare a gift, the vm-3, for laurmaedje, who's on vacation.
I am thinking of first doing an AST flattening stage
then a VM-3 based on that
Currently the big problem of the VM are:
- Complexity
- Non-parallel compile (something I'd like to solve)
- Inefficient data structures
With flattening I fix three, and reduce 1, additionally, during flattening, I would look for static path and launch compilation of those
what's ast flattening? a sequence or a thinner tree of IR converted from ast?
Sunk cost fallacy go brrrrrrrr
😂
still a tree but that uses indices instead of pointers
usize -> u32 (size reduction) and more cache friendly
why dont we use indices when building ast? like rowan, the parsing framework of rust analyzer can reuse (green) nodes by a shared cache from last or even this parsing.
The node struct in typst-syntax is actually meant to be the same structure as in Rowan, and we have an incremental parsing algorithm in reparser.rs, but I'm not sure how optimal it is
@left night What about a hotness and depth based system for disabling memoization?
If the function is called within a function that is cold (rarely updated), its sub functions could be skipped from memoization
essentially trying to memoize at the highest level we can as often as possible
If the caller is not hot, we disable memoization
(locally right)
Unless the function itself is very hot (called often)
since we don't initially know which functions are hot, isn't that something that should rather be handled by eviction?
if a sub-result isn't reused quickly, it will get evicted
Hmm, I suppose yeah
and then it will not be computed since higher levels are memoized
I guess that's true that eviction will evict small functions quickly-ish
But if they were evicted previously, we could skip memoization
doesn't help with peak memory usage after first compilation, but after a few seconds it kicks in
true
perhaps yeah, though it could also mean that user changed which part of the document they are writing in
and then something that was previously cold is suddenly hot
That's the tricky bit and the point of keeping track of "hotness"
which can be integrated into comemo
yeah
so it would be something like enabled = hotness > something or not_previoisly_evicted the problem is checking for whether it was previously evicted (could be at the function level, but it would still require hashing arguments to perform that check intelligently)
Other thing we could try: we could evict more quickly if the function is deep in the call stack
Have you tested regressions recently @sturdy sequoia ? Curious if all the big layout stuff has had any impact
There were big layout stuffs? 😮
@lunar kettle when you have time, can you test the PR I have opened, #4871 it should fix all Oklab colors in PDF (not the way I'd like...)
will test right away
bruh it's 1 AM, it can wait for tomorrow ❤️
Yes
dw the final thing before going to bed 😂
I can also just give you the PDF 😄
I was working on my pdf crate 👀 and it's soon ready for the first release I think, hehe
For instance this
@lunar kettle
Well isn't that nice
Still depressed having removed "native" oklab
thsis should in theory make it possible to encode as cmyk too, right? even if it's a bit lossy
for the future I mean
yeah :(
cmuk should already worked
there was a PR to fix it
The PR is well titled 😂
Did you see the pr I linked above?
yes, I am building but it's slow
Gotta get one of those new 192 core epycs
Speaking of which, I just ordered computer parts, including a 7800x3d
Couldn't really bring my computer with me to the us
Though I did bring a PS5, XSX, GPU, ram and ssds 😂
Using a super simple "best of 10" approach, I get the following results:
commit cold t1 t2
c4dd6fa0 = 1.99s 467ms 115ms
a7c4aae3 = 2.26s 440ms 117ms
So there has been a significant slowdown on masterproef
Oh no 😦
You had a 7950x3d?
Wait you're already in the US? 😱
Yeah
Congrats ❤️
Yes but my RAM is no longer overclocked, it just wan't stable enough
and kind of pissing me off
4 sticks of RAM OC'ed on this platform just isn't that nice
As in no Expo?
yes
I had managed to make it super stable but at the cost of insane temps on the I/O die by manual tweaking of timings
4 sticks of ram is a shitshow on any platform. Honestly they should just ship motherboards with only 2 slots
probably yeah
I used to have a Ryzen 2950x with eight sticks, that thing was rock stable
||that gif was annoying me||
frankly, it wasn't that bad, in NUMA mode Windows was actually clever enough
(on the 2950x)
yeah there's some weird scheduling things with core parking
just Windows being windows
Laurenz is purposely doing it so you'll have something to work on
haha 😂
I like speaking to you
but it's almost 1:30 and I am tired
😪
Have a good one ❤️
Nighty night
why would you need to hash arguments for function-level eviction checking
because you still need the hash to store it in case it was needed
I don't follow
https://github.com/typst/typst/pull/4876 @sturdy sequoia you gotta test performance again 😎
This pull request contains a full rewrite of Typst's realization subsystem. This work is the result of a long time of planning and incremental improvements toward making these changes possi...
Sounds more like a call to not do too much performance testing.. But a quickie The Thesis performance stats would be informative..
ill give it a try on my computer
Before:
typst c main.typ --root ../ --font-path ../fonts 7,67s user 1,12s system 323% cpu 2,721 total
typst c main.typ --root ../ --font-path ../fonts 7,68s user 1,20s system 320% cpu 2,775 total
After:
typst c main.typ --root ../ --font-path ../fonts 7,43s user 0,96s system 316% cpu 2,647 total
typst c main.typ --root ../ --font-path ../fonts 7,52s user 1,10s system 309% cpu 2,782 total
typst c main.typ --root ../ --font-path ../fonts 7,45s user 0,99s system 316% cpu 2,668 total
holy f, the oldest still open issue
Does that compare it to the previous commit?
I'm surprised it's actually faster. But that'll depend on the document I guess
yeah I just realized i forgot to check what I had before that 😂
will try again
should maybe also test a document with lots of show text rules?
assuming it even still gives the correct results
rude
even if the new behaviour would be less surprising any documents written for the old one could have worked around it in an incompatible way
I just looked at the thesis before/after and it seems to be exactly the same. But I'm not sure it even uses text show rules.
nice
But generally, I would not be surprised if some bug sneaked it. It's a complete rewrite after all.
Though I did rewrite it very incrementally, constantly running the tests
Something like 100 WIP commits that nobody will ever see
I wouldn't worry about that anyways, this is a change for the best
and Typst is in beta for a reason
actually we dropped the whole "beta" wording ^^
but it's 0.x of course
the beta label was mostly for the web app
From "beta" to "better" 🥳
Holographic content realization - content appearance depends on what angle you loop at it (only IPS panel screens supported)
Me too :)
cetainly imagined as much, don't be mistaken
that actually looks like a nice lil' performance bump
@cunning wadi Easiest 20% gains in masterproef compile time yet: upgrading my RAM so that I can use EXPO 😎
😎
Wait you weren't using EXPO/XMP?
With four sticks, as BIOS updates rolled out it was too buggy/crashy
I should check how fast my new computer is at mosterproef™️
masterproef * :-p
Using 2 sticks now?
yes
2*32 or 2*48?
64GiB
Yes but I think it’s only supported on intel CPU’s
the only CPU to kill themselves
I mean amd had explosive cpu with the x3d on ASUS motherboards 😂
But to be fair it wasn’t their own fault
Unlike Intel
AHAHAHAH REALLY???
Yes early on with the 9000x3d some motherboard would shove wayyyyy too much voltage down the 3D stacked cores causing them to die
But it was motherboards not respecting the specs from amd about max voltage
No it was purely the motherboard fault unlike intel whose guidelines lead to degradation long term due to poor design and a need to keep bumping the clock to stay competitive
Fairly sure that's not the case
No I have 2x48 with a ryzen, works great 👍
Huh I didn’t know
If that's what you tell yourself to sleep at night!
16 is too much ;-;
Those 833 ns really made a difference
Did you compare with 0.11.1?
I didn't
I mainly wanted to see how fast my new computer can compile
I can do so later though
1 sec thesis is really nice.
Main is increasing capabilities, whilst still retaining respectable performance.
that's with the optimized masterproef, right? (latest commit)
I made the figures smaller and optimized a few things
and 400ms hot which that is too long imo
it should be no more than 100ms incremental even for 164 pages
🤷♂️ I don't know man, performance would be nice, but we are getting a ton of features... And those are a bit necessary. Execution performance isn't necessary... Lower memory usage is though.
It is a balance.
Feel free to make typst fast again
The thing is I keep doomscrolling issues on GH but I don't know where to contribute
performance has gotten so tricky do optimize (multithreading really isn't helping :()
I found a few low hanging fruit but I think that's it
... What about what laurmadje wrote in the PR... Let me find the link...
This one specifically..
I wonder if just re-allocating Vec for StyleChain would be more efficient that linked lists
I mean linked lists are known to be quite bad for CPU caches and allocations end up being fairly cheap
I guess the cloning into the new vec would be the most expensive
How many styles chains are there on average and how large are they
each set and show rule adds one or more element to the chain
one for each field set in a set but afaik they're grouped in an array
Mhm
StyleChain is a linked lists of slices
But I guess in a large document that can get quite long
Do they share common nodes?
yes afaik since they're built from the top of the doc down
But there @left night is far far far more knowledgeable than me
I probably shouldn't be talking without confirmation
oh is it like a cons list?
I wonder if something like a persistent map would be more efficient. It would retain the sharing of the style chain for the nodes up the chain while being faster to index into.
As hinted in the PR, I want to rework all StyleChain<'a> into &'a Styles. Styles is planned to become more than just Vec<Style>. I don't have very concrete plans yet, but I want to make it somewhat efficient to clone, but at the same time very fast lookup for style properties, recipe matching, and lazily initialized RegexSets for text show rule matching.
what do you mean with "multithreading really isn't helping :("?
it makes flamegraphs really hard to decipher
even with just two threads (the minimum) it's very difficult to interpret
I think we could make it not use a worker if threads=1
that would make it easier to optimize single-threaded performance
I think that would be better indeed, would make profiling with say VTune a lot easier
Feel free to give that a shot
Perhaps, depends on the exact operations I reckon but I was thinking of persistent data structures too
Me too
it's nice to have been bumped in performance by tweaking the doc itself 😄
I'll do that after work
In my very unscientific testing on a few documents, performance was not affected much. While working on the PR, I primarily on focused on getting things right and not hurting performance too much, so that's a good result I would say. Fundamentally, doing relayouts is of course more expensive than not doing them.
Source: https://github.com/typst/typst/pull/5017
Yeah but he is the performance guru
Care to explain? 😄
@sturdy sequoia I'm sure he meant this. That's why I paste'd it in.
New layout dropped
https://github.com/typst/typst/pull/5046 does this only address regressions since 0.11.1, or even stuff from before then?
But it will come with increased ram usage
balanced as all things should be
I've said it before, typst pro should come with a stick of ram
Meanwhile me and my 24+ GB being used by typst while compiling a presentation 😭
Polylux? Png?
Polylux initially, then switched to touying. To be fair though there are a bunch of fletcher diagrams and I reached 24GB while editing them (just compiling when using touying is much more manageable with 1-2GB now). The presentation is like 70 pdf pages, it would be 17 slides but with a bunch of pauses in slides with diagrams.
Ok.
a bit of both
many blocks very cached before (at least I'm fairly sure they where), but some weren't, in particular equations
Yeah I’ve had similar issues with large cetz diagrams or when updating godly
Coldly *
Btw I couldn’t measure any performance improvement on my machine 🥲
not every optimization is tuned for your thesis 😉
I have a math-heavy document I got from someone and incremental is 6x faster with this PR
Noooooooooo
Wow that’s huge
equations were just not cached at all before. oops.
on main, actually no single/multi blocks were cached, only user blocks
on 0.11.1, only LayoutMultiple was cached, not LayoutSingle (at least I think so, I didn't 100% validate that)
Good news is that it’s now cached!
I have a math heavy doc of mine but it’s on the webapp so I wonder if it will benefit from it
I'll definitely be testing that on some of my own documents too
I do have a few with some math which eventually got slow in incremental , purely anecdotal though
And it wasn't that much to bother me, was just a bit more than usual :p
Thought it was normal for the size of the doc, but maybe it improves anyway 👀
so
while answering a forum post, i found one small document that got 7x slower on 0.12.0-rc1 for some reason, which is interesting
(from https://forum.typst.app/t/how-to-create-a-table-with-round-corners-like-in-rect-function/1051/)
#table(
fill: (x, y) =>
if x == 0 or x == 6 or x == 7 or y==1 { silver },
columns: (0.45cm,0.45cm,0.45cm,0.45cm,0.45cm,0.45cm,0.45cm,0.45cm,),
inset: 2pt,
stroke: 0.2pt,
align: center,
table.header(
table.cell(colspan:8, fill: silver)[*Dezember*],
[*KW*],[*Mo*], [*Di*], [*Mi*], [*Do*], [*Fr*], [*Sa*], [*So*]
),
[48], [], [], [], [], [1], [2], [3],
[49], [4], [5], [6], [7], [8], [9], [10],
[50], [11], [12], [13], [14], [15], [16], [17],
[51], [18], [19], [20], [21], [22], [23], [24],
[52], [25], [26], [27], [28], [29], [30], [31],
)
0.11.1 -> 5 ms, 0.12.0-rc1 -> 35 ms
very odd
timings: (0.11.1)
0.12.0-rc1
seems like pad is more expensive now
okay i can confirm this happens even for this MWE
#table(
columns: 2,
[a], [b], [c], [d]
)
3 ms -> 10 ms (~3.3x slower)
big sad
okay false alarm, i blundered, sorry folks lol
testing environment was exceedingly unscientific
so we can ignore that
after deleting everything and starting from scratch it works fine so yeah
my fault 😄
wondering whether we could have a repo that tracks performance, and you can run results on action bots by triggering actions of the repo, like https://github.com/ziglang/gotta-go-fast?tab=readme-ov-file
would be cool to have something more automated yeah
i would have probably used such a thing before posting anything here
haha
Basically the person in the forum thread reported a 10x slowdown by wrapping tables in blocks, so i tried to reproduce and noticed there was a slowdown even without blocks
But turns out i was just using some bad binary , maybe built without optimization or smth as it was in some random testing folder
Using the binary from gh releases fixed it (for the case without blocks)
With blocks there is a slowdown on both Typst versions , but nowhere near 10x for me
Beyond that, it should be great to have something like typst-test bench that provides machine-independent statistics, then we can make such a repo with the tool. We can simply run the tools locally.
More like 1.5x or smth
I has sort of worked on that at some point
But it's tricky
can also be that their CPU is old and stuff like that, or low memory bandwidth and a branch mispredict is very expensive for them?
not sure, but now they say that, in 0.11, tables without blocks took 60s to compile, while tables with blocks took 600s
whereas, in 0.12.0-rc1, tables + blocks is taking 10s
so that's impressive
For our practical university course, the time has also gone down from ~10s to ~3s (on a rather old pc). So thank you for the improvements!
prob not relevent, but in python there's airspeed velocity (https://asv.readthedocs.io/en/latest/) (example site: https://pv.github.io/numpy-bench/). Maybe something similar exists for rust
@sturdy sequoia some interesting stuff I dug up about the history of pre-interned strings in rustc:
- https://github.com/rust-lang/rust/pull/59655
- https://github.com/rust-lang/rust/pull/59655#issuecomment-480598659
- https://github.com/rust-lang/rust/pull/60630
- https://github.com/rust-lang/rust/pull/95726
in short: rustc has a list with all the stuff in one place. I'd love to have a const/staticPicoStr, but such a list is kind of a pain. I wish the compiler could somehow do it for us, but looks like the rustc folks also didn't come up with a better way.
Indeed, having static strings to at least skip initialization of all "built-in" strings would be so nice, I wonder if eventually we'll be able to do it with const functions?
Or I could just spend the two hours to make it by end lol
replacing the pico macro with (essentially) a static list of strings it matches over
even with const fn evolved much, I think it'd be hard/impossible because it needs global state
maybe we should first discuss what kind of strings we'd like to have statically interned in the first place
right now PicoStr is used exclusively for Labels
I think it would be nice to use it for all fields (possibly replacing the u8 field ids?)
I also would like to use it for HTML tags & attributes in the future
and I'd like to have the tags available in const contexts
I know that using it for fields and function args actually makes a decent perf bump even with the current impl
(since I had done it for the VM notably)
Huh it's funny because I hadn't tested it, since the VM used IDs for variables
we probably checked before, but are rust static strings guaranteed to be dedup-ed?
maybe each scope could have its own list of IDs for variable (so as to not polute the global interner)
const ones are but iirc only within one crate
so we can't rely on pointer comparison for string literals?
I am not sure
also unfortunate that strings have alignment of 1
kinda prevents using padding bits for tricks
This is the best source I can find
but I guess distinguishing between static strings and runtime ones doesn't work anyway
😦
