#Performance
1 messages ยท Page 5 of 1
@sturdy sequoia wake up I found the hasher
so we are like 5 times slower than python now ๐ค
WHICH ONE
๐ฎ
You did better โค๏ธ
is it faster than main now? ๐
It is!
i mean that's something!
Not great not terrible
baby steps ๐
callgrind now reports that the slowest function in the projects is Value::hash?? Somehow??...
I don't think I trust it though
it's not a very smart function ๐ค
I know right
yes, there is still some hashing going on
somewhere
||over the rainbow||
which is weird 'cause it does... nothing
except some std::mem::swap
Well.. yeah, it does swaps
they should be cheap, no?
unless it's copying ungodly amounts of ram
(did someboday say "wrap it in a box"?)
let me check the code, idk ๐ค
On my machine it's now about twice as fast as main ๐
(mind you I have a more optimized version than you, I'll push gimme a minute)
There you go, I just pushed ๐
the Joiner is not that small ๐ค
True, it's surprisingly large
but like that's just a single memcpy, right?
RIGHT?
it should be, I think
I want to disassemble it but I am scared of going through inlined garbage
does every iteration of the loop requires enter_scope?
no, just once to pass the iterator
ok I see
sad? why?
you got this
Thanks bro โค๏ธ
the more meetings, the less demand is going to be on you fam.. Just breath and be yourself.
yeah
if you can write a typst vm, you 100% can uhm...
give photons a little push to turn a little? idk
it allocates about a kilobyte of stack space, and then takes its sweet time copying what I assume is joiner with ~20 SIMD instructions
so seemingly nothing special about it
... and then run is inlined so it makes so sense afterwards ๐ญ
An interesting thing to do may be to memoize hash of value right in that same value, possibly with interior mutability, because it seems to me that same objects get hashed over and over again. In my call graph it goes up to 7 levels of hashing deep, and I have no idea how is it even possible. I don't have anything 7 levels deep in my raytracer, I think ๐ค
about 1/6 of all memory reads is spent hashing level 2 objects actually
this is a weird call graph though ๐ค
yes, I think that comemo::memoize for small functions is kinda bad
if big functions accept deeply nested objects and hash them over and over again, than it's just as bad
yes, I think we should only memoize bigger functions
because it might make most of the cost of execution being hashing when computing the function is cheaper
well... yeah, I suppose this would be a start
it's fairly easy to do now since we can just put a threshold on the number of instructions
you can do some simple complexity analysis
ACTUALLY
It's quite powerful
because you can introduce "move-sematics"
or something like that
because small functions may not require clone on write
especially when they are not memoized
so maybe it's possible to not copy objects => to not bump reference counter => to make writes in them cheap because they no longer require clone
I added inline(never) to run and now it does not use SIMD for some reason
I love compilers
I wanted to try and do something in that direction while merging content and value. Since everything needs to change then, it's a good opportunity to introduce some lazy prehashing.
Right now, with Value being an enum, it might be tricky
Yeah I saw that
I wonder whether it's possible to gain something by just hashing, say, dicts and arrays
because if I understand correctly: 1. They are used very actively; and 2. They do require recursive hashing, and it is pretty slow
ok no sorry I won't do that right now - turns out it's not that simple ๐
There was a release mode? (Jkjk)
Where's the hashing? ๐
They had us bamboozled the whole time!
Ok, so, with main it runs 155e9 instructions while with the VM it runs 76e9 instructions
quite an improvement @atomic violet
Oof
it literally OOM'ed ๐ฆ
I just wanted higher resolution uwu
oh no my WSL just OOM'ed
like, the entire thing
@atomic violet that's the highest resolution I could get ๐
Are there instructions in main? ๐ค๐ค
like real CPU instructions :-p
you ๐ฅ
โค๏ธ
So... It is twice as fast?
Also, typesetting is what we are here for, but this might be the ideal thing to build a vm. I still don't really get it.
Thesis still at 14%?
(new phone)
Oh yea should have looked at the numbers ๐
Hey, I did some improvement, it's 10% faster than it was five minutes ago (relative to that)
lemme test
it's actually only 10% faster since main is a bit faster than before (some changes by @left night)
but I mean it's 10% in my thesis, in something like @atomic violet's raytracer is twice as fast
it also uses a ton less ram
which is nice
But dude didn't you say eval was only 20 pct if your thesis anyways?
I since updated the version of "main" I am comparing to
I was like 40-ish commits late ๐
Alright. Taken ram usage down a notch would be great too............
it's about a 35% reduction on elteammate's doc
and I just gained another ~5% on the raytracer ๐
I mean it brings my thesis down to 5.5s compile time
that's pretty darn good if I do say so myself
down to just 55B instructions (real ones) from 155B on the latest main!
And I think there is still room
what about watch? or is this unaffected by this
Lemme try and get back to you, from what I can tell (I tested earlier) it's slightly improved too
Unfortunately I have a (vscode?) bug that prevents me from getting good measurements in incremental because it somehow saves the files twice

I lost some performance (for better code quality & clarity) but I regained it somewhere lese 
Race between main and vm 
This is probably the same bug as https://github.com/typst/typst/issues/3312
Ah, interesting, well yes it's since I merged with main that I've had the issue ๐
I'm pretty sure that this change is the cause: https://github.com/typst/typst/pull/2665/files#diff-ea95583ebe87e25ee323e9ad5d1a654f2fbf243e9f4d1727c49d7271585149bdL40-R46
How would you modify it such that it works again?
Because if so I don't mind fixing it locally and testing it
(since I can reproduce the issue)
I can confirm that it fixes it
I'll open a PR right now ๐
Thanks for testing! I'll wait for @cunning wadi to say what the original idea behind the change was in case some change was necessary after all.
@left night Have you thought about adding a swap_remove method to EcoVec, my reasoning is that for arguments, we use a linear lookup into an array of Args and then we remove them, but that means that for each argument we eat, consume, etc. we do a full memcpy of the contents of the arguments. While that is not necessarily slow, it's wasted CPU instructions & memory bandwidth as opposed to a swap_remove!
I've actually done a ton of optimizations to other aspects of closures and this seems to be the main remaining bottleneck
Wouldn't that mess with the order of positional arguments?
isn't their order stored in the actual Arg struct?

Or am I stupid?
'cause then you're probably right
๐ญ
but but memcpy ๐
Ok, the VM is on the verge of being 3x faster than main
๐ช
and the code is still readable
nope
unfortunate I suppose
It would require an iterator style approach
or we could leave behind empty slots or something
that could work I suppose, but the overhead might be higher than the gains
I'll try that real quick
@atomic violet if you wanna try it I just pushed loads of optimizations ๐
it compiles ๐
Without the math equations *
did you break it?
no, math functions have special semantics I haven't replicated yet
the information is in the instruction, I just don't use it yet
best optimization is just not rendering some stuff ๐
I have literally three equations in my thesis ๐
Only two of them are disabled
three equations that use functions *
so how much faster is it? 20%?
about 10% total ๐
Which isn't much
but I think it can be further improved by decreasing the cost of instantiating a VM
which is possible mind you
But, it uses 32% less RAM
which is amazing too imo
and well worth it
I think a thesis shouldn't consume 2 gigs in the first place... So honestly, it might be very good.......
I think that with selective memoization based on function complexity, we might be able to reach even lower!
2 GB for an entire thesis isn't that bad
If you've got no way of lowering that? Sure. But if you can get that down a notch, that may justify a vm by itself.
do you know why it does that? I would expect comemo usage to be roughly the same and shouldn't the active evaluation be dwarfed by the comemo cache?
I know right
I have no idea why it does that
Weird, I retook my measurements and got different results from yesterday
It's still lower, but only by 10%
๐ค
Which I guess makes more sense
It gets slightly better when factoring incremental going to ~15% better after a few changes
so there is definitely something here
just less than I had measured
weird
Maybe I misread? ๐
Now I confirm, on @atomic violet's raytracer it's 32% better
so that number didn't come from my thesis but the ray tracer
@atomic violet that's awesome
webgpu support when?
DOOM port when.
to be fair, with the new VM, you know...
And by one I mean 10% on the raytracer
๐
I managed to remove sooooooo much cloning
Nice! โค๏ธ
Typst: the clone wars
That's the name I was going to give to the commit ๐
Ok my thesis is now 12% faster (which doesn't seem like a lot, but it's nice anyway)
I have done tons of improvement to show and set rule handling
reducing cloning, complexity, etc.
@left night what do you think of making the argument names as PicoStr
since those are pretty much always very similar
it would make argument matching much cheaper overall
we can try
It's actually a fairly simple change, I modified the macros to directly build the PicoStr using a static lazy, and then I made a pico macro to do the same everywhere else (to avoid rehashing the keys everytime)
It is so sad that we need to build PicoStr in a lazy static instead of being able to do it at compile time :/
ikr, I was thinking that maybe there would be a way to do that
but it would not be wasm compatible afaik ๐ฆ
You know what's funny: The wasm binary is in some respects optimized in ways the local one cannot be.
Specifically, when control flow depends on comparing addresses of statics, which happens when checking which kind of element some content is
Because the native binary is ASLR-compatible, so the addresses are resolved at startup time
Meanwhile, for WASM the Rust compiler does also only provide the addresses at link time (so post-optimization) BUT after that we run wasm-opt which can work with the constants.
of course!
That's actually kind of awesome ๐
So named arguments as PicoStr, did it help? yes, but not as much as I was hoping ๐ฆ
I expected that it would make hashing Args cheap enough to help, but it seems that generally the values of the args are the expensive bit ๐ฆ
Is it worth it? I think so because it also re-uses existing systems we have in place for other, similar things, but I am dissapointed that it doesn't do more ๐ฆ
It does improve memory usage by about 80MB on my thesis
so it's something ๐
seems like it's worth it
I think a lazy hash in the future non-inline Value repr would go a long way
I fully agree
I actually implemented something similar for closures: they have a lazy Value::Func(_) inside of them so that when they get called multiple times, it doesn't need to reallocate a Repr::Closure
BTW, using PicoStr on @atomic violet saved around 300M instructions out of 49B instructions (real CPU ones that is)
I'm not. When reporting metrics, it is better to represent the numbers as clear as possible.
I just gained another 500M instructions ๐
I'd love time in between commits as well, then I can make a timeseries of the reduced instructions..
It all adds up
500 million here, another 500 million there
and baby you've got a stew goin
Especially since the changes I'm doing are not very intrusive
I'm doing my best to keep the code readable and explain why I'm doing what I'm doing
can you explain that more?
where would it allocate multiple times?
also: regarding performance improvements. If some of the ones you are doing are independent of the VM (e.g. PicoStr), it would be great if those could be done independently in PRs
essentially, closures are treated in two steps, when they're compiled, they become a CompiledClosure, then they get instantiated (the idea here is to pre-do as much work on the closure as possible, and doing the capture "in situe" since we need the actual values to capture). After instantiation you have a Closure (which is ready to be called). That process is memoized, but since a named closure is declared within itself (to allow recursion), that process is stored inside of the Closure using a OnceCell, the idea here being that it enables me to avoid the Prehashed<...> to be done again neededlessly on subsequent instantiation of the closure
i'll try doing that this week then
Depending on the number, it will be rebasing fest / bitch then.
@left night I did end up trying the dynamic size register thing (6, 16, 32, and then infinite using a vec) and on the thesis it's pretty awesome: we are now 15% faster!
So an extra 5% just from that change
!!!
And the size of the binary barely increased (about 200KB)
wait, so is this with generics or something like smallvec?
generics
smallvec gave barely any improvements
I am confused why this would give a benefit
no allocation for small/cheap closures is my guess
I'll profile it now
did you try limiting the amount of registers during compilation, but keeping vec storage?
that's already the case, it agressively re-uses registers as soon as it can with the exception of function arguments
for example the whole of tablex compiles with a max number of registers of ~40
I have this feeling that the gain can also be had without generics
not sure how though
maybe, but I don't know how
so does it pick between the VM sizes at runtime based on the number of registers used in compilation?
what if you use a slice pointing to an array on the stack instead of a smallvec?
So, yes, it looks how many registers it used and uses an array (or a vec if too big) based on that
it behaves like a slice during evaluation regardless whether it's an array or a vec
my idea is that small functions and closures that get called often will be more effective due to not allocating a Vec and using the stack instead
another theory is that it gets cached immediately in these small closures ๐ค
if that's the cause, then this should also help
I was already no longer using a smallvec btw
because it gave no gains but made it use way more stack
Okay but still
using a slice would get rid of generics while keeping it non-allocating
ah I see what you mean!
Oh yes, that's even better, it's the same "advantage" but doesn't use generics
nice
advantage * not complexity
my brain is fried ๐
To play devil's advocate, it would be good to test on more than one document in case there are regressions for other things ๐
(sounds great though!)
Oh you're absolutely right, I also use @atomic violet's raytracer, definitely the most representative document out there ||/s||
it's representative of 70% of my documents, so as a typical typst user, I agree
@sturdy sequoia I can send you some notes I made for a course (around 35 pages), if you want. It is mostly text and math equations with a few cetz drawings. Could be another interesting testing document. Last time I checked, cold compilation was around 1 sec
sure, please feel free too
I will need to fix math then ๐
Did I send you a polylux presentation at one point?
I think I did
I send them here in case anyone else wants to use them
Every topological manifold has a countable basis of precompact coordinate balls.
lol
coordinate balls
lmao
๐
๐
Wiener measure
lol
Hey, I was reading that @untold turret ๐
Do you mean that I could use a thread-local pools of "pre-used" vecs?
Because I cannot use only one vec forever ๐ค
(since I can call functions recursively)
After replace generic const sized register array from vm, you may use a thread local register pool to reduce allocation furthermore. Another thing is to reduce size of the vmstate struct. we have so many referencing slices to module.inner when constructing VMState. We can reduce the size by replacing module.inner.* to a single module.inner or module reference.
that's true, I could harmonize Inner to be the same for functions and closures, it would make it smaller quite nicely imo
functions and modules *
I'll do that right now ๐
๐ฑthen vmstate will be cheap to allocate I think.
Yes, I see you uses a vec! to allocate them. You could give a try, because I got better performance in my c++ executor by reusing large objects in struct.
A vec of vec registers in thread local may simply help. We can "pop_or_default" it to get a vec register and push it for future reusing. We can have three that for 6,16,32-sized vec registers, but not for variable sized vec registers.
does this make sense? i feel like this should be handled in allocator๐ค
good allocators maintain a arena of different size and other complex structs. in some simple cases like
for (...) {
vector<..> v
}
the memory will very likely be reused(maybe also thanks to optimizer)
i didnt read the whole thread though. just feel recent thing sounds weird.๐
if you think of that then we don't have to use many other allocators in rust, bumpalo, arena etc. In general a default allocator is fast, but other allocators are even faster in specific pattern.
So, I was in bed, but I did do the change, it seems to me to be positive overall, I introduced some bugs while doing it which I haven't debugged yet but it definitely makes the code shorter and easier to follow!
@left night any idea what might be causing this error:
The span is clearly valid since it can make an error from it

The trace has a span. The error doesn't.
The argument span seems to be missing
hmm
it's weird because I think it's there
@left night you were right, I was reading the wrong spans and out of chance it was reading a Span::detached() ๐
While doing some cleanup I've also managed to gain about 1B real CPU instructions on @atomic violet's raytracer

that's like 3%, but it's something!
That's good. I mean you've got a huge implementation, and you being able to improve it, also means it is certainly a little cohesive. So not only are you gaining performance ground, you're giving credence to the VM.
Soon there won't be any instructions left
Snap your fingers and just like that
Done!
Just add a raytrace instruction. Easy
There's still 40-ish bilions of them left
don't worry ๐
; typst.asm
section .data
msg: db "Do you really need that document though?.. Think about all the time you will have to go outside and touch grass, if you stopped writing that paper", 0
section .text
global _start
_start:
mov eax, 1
mov edi, 0
mov rsi, msg
mov edx, 100
syscall
mov eax, 60
mov edi, 1
syscall
typst 2025
that's not 40B 
40 bytes? No, but it is pretty close
Yea careful there!
@left night any objection to the use of a Box<[T]>? On the struct I want to use it in, it saves > 128 bytes just from all of the capacity: usize being saved
Instead of what? Vec?
yes
note that it's read only after creation
and it doesn't require re-allocations
it just removes the capacity field afaik
Recent blogs talks about this.... Owned slice thing. So people are adopting that pattern.
yooo
Nice! Memory foot print is also of interest. I think 18% is a substantial speedup...
indeed since it's stored in a Arc in the end, it does account towards memory use, although not by much of course
BTW, fun fact, with the other improvements I have made, 40% of the execution time of @atomic violet's raytracer is now spent just on argument hashing for function memoization ๐
That's good!?
I don't know ๐ค
I don't like it!
it's basically wasted work
Make the hashing faster?
Is it because the hash is 128 bit that it's slow @sturdy sequoia ?
yes, it's comemo hashing
๐ฆ
no idea, but we need 128-bit for collision resistance ๐ฆ
The thing is
probably quite a bit faster indeed
So
Hear me out
Only fall back on 128 bit on an actual collision
Or use two different 64 bit hashes
Instead of 128
that would be worse on most CPUs since we have SIMD for wider instructions
the thing is that you can't detect collisions
Right, in my defense I just woke up ๐
What do you mean wasted work?
VM allows for decreasing memory usage; that's a win
VM gained 18% performance increase on an actual document: that's a win too
VM enchanced raytracing example to now be.. 2-3 times within python performance; double win I guess?
Well it would be at python performance if it wasn't for all of this hashing 
Presumably these functions don't even have to be memoized, since they're never called with the same argument twice
Implementing RenderMonkey in typst is of course an important goal, but for now, we can just bask in the glory of having typst jump leaps and bounds afar from what it was mere months ago. I guess that's a small win.
Yes. Implement function attribute in typst, and disable caching @sturdy sequoia .
I mean, it's not a terrible idea
Not at all. Probably belongs to another PR.. And actually.. couldn't it be used by other packages rather than just raytracing?
Yes
AFK meeting
I increased the ray bounce
(That was a joke)
I mean, we should first hash to 64-bit then to 128-bit hashing using the 64-bit

(obvious /s)
Anyway short of faster 128 bit hash, or the disabling of memoization for specific functions, I'm not sure there is much you can do
yeah that's about it
๐ฆ
How is hashing of multiple arguments handled? Is it one hash for everything?
I mean it always hashes them one by one
but yet you get one big hash out
Does this step have to happen if the inputs are all numeric?
In general, does numeric input even need hashing?
I guess that's overcomplicating things though @sturdy sequoia
@sly pecan it's more that we don't know they're numeric anyway
Because it's essentially a Vec<Value> where Value can be numeric or it can be anything
But typst has types
yes, but from the point of view of comemo they're opaque af
No, but then comemo would be pretty much unusable for others
or we would get into trait hell for every single value we want to pass into comemo
Are you sure? Maybe this new stuff with impl-trait in assoc type can help with a slick API for that?
Full disclosure, I wish this was possible for my own interests / projects ๐
maybe, but I don't think it's worth it either way
The only solutions I can think of are:
- Using a faster hashing function (assuming one exists that meets the requirements)
- Detecting whether memoization is worth it, but this is a double edged sword, to check whether it's worth it, you would still need to hash the input arguments to check whether it's memoized so it's kind of... useless
- Annotate functions that don't need memoization (in typst that is)
- Caching hashes as suggested by @left night in the
Valuerework
does typst has any good incremental benchmarks?
all this talk is more about incremental compilation, than a single compilation, and I wonder if there are ways to measuring recompilation already?
Regarding number 2, it would reduce memory usage
Presumably
At least
I like option 3 as an advanced feature for package authors
Disabling memoization is like adding #[inline], I feel. Probably won't help much, but if you know what you are doing, go for it
the one downside, I feel, is that if Typst memoization mechanism ever changes internally, it may affect API too
Shouldn't it help quite a bit for simple functions such as used in ray tracing? Fewer memory accesses.
It may
the question is whether it can help, say, cetz
it will also help in this context #1176509648707256370 message
maybe when custom types are going to come out there can be a distinciton like "we cache function, but not methods"
I think what would help cetz the most would be moving some of the computation into the rust side
Dumb idea but have you tried turning comemo off to see if it actually makes performance gains in the eval step?
That's causal profiling. Very, very advanced idea. See https://github.com/plasma-umass/coz
So im actually really smart? ๐ฅ
I'm pretty sure that this is the most important thing here
There is a lot of recursive redundant hashing going on currently
I do not have a problem with using Box<[T]> where appropriate.
In this case it saves 128-bytes in the structure containing the compiled data for a module/closure, since it's read-only it's worth it imo
yes indeed, I think this could help
Unfortunately the lazy hash would need to be atomic I think
Another thing I was just thinking about is to make arrays be sharded, each shard being pre-hashed
it's a bit convoluted, but it could definitely work because upon initial hashing, that can be done in parallel
divide and conquer type of approach
I was thinking more of sharding each array into chunks of N elements
so an array of 1000 elements would be like 4 arrays of 250 elements or something
Might be completely dumb mind you
Maybe you can use 64bit hashing and direct comparison?
I mean we have two ideas:
-
Collisions may rarely occur even when using the 64-bit version.
-
A function is rarely called twice with the same arguments.
Then we can do as follows:
-
Do not use hushing under any circumstances.
-
Switch to using a 64-bit hash plus direct comparison. So if we don't have the same function hashes, obviously they will be different. At the same time, any collisions will be caught by direct comparison. This strategy will actually be more reliable than even using a 128-bit hash.
You can't do direct comparison unless you also store the original object used to call the function
What does python do for hashing. It can be quickish?
I don't know rust, if it helps, there was a discussion for Godot to replace the hash functions by more powerful versions, there was in particular this link where there is a comparison between different algorithm: https://github.com/Cyan4973/xxHash
Can you please describe this explanation a bit more detailed? I really don't know why we need to know something about the object
There must be a hashing, that is made, for detecting hashing of a similar object, that has changed ever so slightly.. ๐
And why we can not compare functions directly?
Don't know.. Something, something reflection, dynamic something?
May be ๐คทโโ๏ธ
In rust function pointer is just a number (with some provenance and address-space information but this used only by compiler itself), so that comparisons are cheap
This is literally in what was released today: https://doc.rust-lang.org/stable/std/ptr/fn.addr_eq.html it says that it ignores the meta data..
Compares the addresses of the two pointers for equality, ignoring any metadata in fat pointers.
-
Well, the state of the art we're based on, tend to use 128-bit hashes, I would assume that it does come from experimentation. Although, I agree that for most cases it would likely be enough.
-
That's not really true: consider that typst runs your code multiple times in a raw until your document stabilizes, this means that all of your
query(...),locate(...), etc. get called N times (up to five), memoization does save a lot of time there
Switch to using a 64-bit hash plus direct comparison.
Direct comparison would be bad because the structures are very tree-like in Typst, which means that you would get extreme cache innefficiency, which has improved a lot during the content rework a couple of months ago
That is also a factor indeed
gxhash would likely be a good alternative too since it's in rust but it currently has no fallback for wasm and unsupported platform (it uses AES instructions in the CPU for acceleration) and is, to some extent, good, it's not cryptographically secure but it's on-par with SipHash 1-3 for collision resistance and DOS protection. The problem with xxHahs is mostly that it's a C dependency and we try to avoid those because they can lead to linking hell on WASM
Yes, direct comparison is very expensive, but it can occur only in one billion of times!!!
Or maybe even rare, I don't remember actually๐
comparing functions is cheap, comparing inputs to functions not so much, if we used direct comparisons, we would need to do one unholy thing: we would need to store the arguments of each call in addition to the hashes which would eat up RAM like there ain't no tomorrow
While I agree, the main problem is detecting problematic collisions, let's take a simple case:
- You have a function calls twice with two arrays
- They cause a 64-bit collision
- You have to either compare the arrays in their entirety (therefore needing to store all previous arguments)
- Or you must have some kind of additional information that you can use to dissambiguate which is not necessarily trivial ๐
For arrays, you could maybe just store the length and have a tuple (u64, usize) with the hash and the length, but do you do that for each argument?
It gets really tricky
HMMMMM
that sounds awefully like... a hash ๐
It can be counter along with 64bit hash?
yes, but that does help you dissambiguate, remember, we want to check if two calls to the same function are the same
we already cache per-function
admittedly, since we store per-function maybe we could do with 64-bit hashes
since functions aren't called a bagilion times
๐ค
@left night what do you think?
I didn't know about gxhash, are you sure there's a C dependency? Xxhash has two implementations in Rust. I don't see a compiled or c file in the 3 repos, but I must be missing something I guess.
@sturdy sequoia I honestly feel like anything other than a 128 bit hash might just be overcomplicating things. The raytracer is pretty much worst case scenario
How much time is spent hashing on a more sane document?
sorry to bring that again, but are there any incremental benchmarks? ๐ I am genuently curious
I do them by hand my taking multiple samples and writing them in an excel sheet

My thesis is still 20 percent or so
My bad then ๐ฑ
I confused it with another hash library most likely
Just like another strange idea:
What if we use two hash functions? One for example 64bit and other 32bit? May be with some parallel hash computation.
Whis may definitely exclude any collisions
Well if we could do with 64 bits that would be even easier ๐ซ
Doing two hash round wouldnโt give you better performance because youโd most likely be bottleneck by your cache and memory bus 
Is SipHash 1-3 calculated in parallel? In this case, yes, there's no advantages
I feel like hashes are so fast to compute, you will lose a shit tone of cycles due to syncronization overhead and false sharing ๐
yep
hmmmm... maybe not false sharing, actually, due to COW semantics ๐ค ....
but it's still won't be very fast, semaphors are not free
Out of the eval part or total?
I'm curious by the way, does x86-64-v3 improve that? @sturdy sequoia
It would be significantly weaker than the current 128 bits
total
not much afaik
I did try it and it wasn't much
SIMD in Typst ๐
Swapping allocators is much more impactful
I assume you've tried two different 64 bit hashes instead of one 128 bit?
(yes i know it's unlikely to be faster)
I think optimizing hashing now will yield inaccurrate results in the long run because semantics of Value are about to change
I would rather wait a bit until API changes will stabilize, and only then try to squeeze out cycles out of hashing
it's not that hard to do later, doing it now will probably just complicate further design changes and more productive optimizations
(let alone increased compile times and binary size)
Not so much. It will be 2^32 ร 2^16 = 2^48. Of course it less than 2^64 for the 128bit hash, but it is actually huge number
You've reduced the expected number of hashes before collision from a number with 20 digits to one with 15
That's the difference between functionally never to maaaaaybe
And it really is important that it never happens
no I haven't ๐ค
The thing is that other than caching hashes, no matter the hash function, it will still show up
hashing over and over again will always be there
the difference is whether we can cache those results
and the answer is: yes we can
On a side note: Damn Obama was so charismatic, can we go back please?
Nah, you'll get four more years of Trump instead
We're together in this 
Knowing you'll be in the US too
You know that guy is still alive right? And he has time? Maybe we can ask him to be spokesperson for typst? (Obama!)
Clearly that's what we need, to pay obama to come make a speech at Typstcon
He recently gave a talk in Belgium for the meesly sum of 650kโฌ, I'm sure <@&1200368100202258552> can get that much cache just for him ๐
pay?! are you nuts? I'll just show him the gradients+VM+raytracer, and he'll do it probono.
Will there be a ball pit?
You get 20 minutes extra in the ballpiiiiiiiiiiiiiiiiiiiit!!!!
never forget
I am not convinced that the 128 bit hashes or even the hash function is a real problem. I think it's really the hashing of the same thing over and over again.
Until we do the value rework, if you want you could experiment with doing the lazy hash only on content rather than arbitrary values. We can then remove the Prehashed on flow, par, etc. and see if we get gains or not.
Jumping to conclusions and risking hash collisions in real documents seems unwise to me.
Fair, I might just try the whole lazy hashing, see how that plays out ๐
I'm thinking of an Atomic<u128> that gets reset when the DerefMut impl of Packed is called
something like that
exactly!
you can also play with whether the hash is just for the element data or also the metadata
if it's the former, no invalidation is necessary for metadata changes
I was first trying to set it inside Inner<T> since it should be pretty easy to detect modifications there
@left night is it normal that Content::with_mut doesn't check uniqueness? ๐ฑ
There's definitely something, my initial kind of baboon-level implementation does speed things up, not as much as I was expecting, but it does make a difference
It's explained in the comment above ;)
can't we somehow specialise hashing such that values smaller than 128 bits are just copied with some TypeId-esque tag or something?
I think I suggested that, but @sturdy sequoia said it was impossible
No, I don't think this had been suggested, that's actually quite interesting for comemo, inside of typst it would be hard, but inside comemo it could be interesting!
@left night I removed all Prehashed<Content> (I had forgor ๐) and I am shocked, it really does give a nice bump in performance across the board, I think this really is the way to go for Value has well
It does contain the label, location, lifecycle, etc. but not the span (which tbf is small enough that we shouldn't care too much imo)
very nice!
Btw I did measure the hit rate and itโs about 90%
So 90% of the time the hash was already stored

This is also the case in incremental but sometimes (depending on change) drops down to 82%, so I am wondering whether not pre-hashing the location would be beneficial
to avoid recursively hashing the T which is likely the most expensive anyway
yes, I also think lazy-cache-hashing just T would be worth a try
I'll just add a LazyHash<T> to util then!
sounds good!
how did you realize a 128 bit atomic btw?
with the atomic crate
since it's simple swap and store operations, you can just do atomic memory operations on it ๐
load and store *
I'm tired ๐ด
will that incur two atomic ops or do CPUs have 128 bit atomics nowadays or does it use a secondary atomic for controlling access?
As far as I can tell it's using atomic operations on pointers
And there is an AtomicU128 in core apparently ๐ฑ
looking at the crates source it sort of seems like they use a spinlock on stable rust
That's essentially what I meant when I said numeric types, though that was a more fleshed out idea
What do you think @left night ?
This library will use native atomic instructions if possible, and will otherwise fall back to a lock-based mechanism. You can use the Atomic::<T>::is_lock_free() function to check whether native atomic operations are supported for a given type. Note that a type must have a power-of-2 size and alignment in order to be used by native atomic instructions.
You're right, on nightly it would have 128-bits atomic
I mean it's still faster ๐
I wonder whether a spinlock is more efficient than two AtomicU64
I'm curious as well now
AtomicU128 is actually only available on ARM64 cores ๐ฑ
I do not yet really understand how that should work.
Switch dynamically between a hash and the raw value?
instead of storing the hash you could store the arguments directly if they're small enough
since comemo is now generic it could be done very efficiently
But that will pretty much always be larger than the hash
We're talking only for integers essentially
Note that this wouldn't apply to Typst functions that take integers. Just to Rust functions that are memoized.
Aha
Hopes and dreams dashed
I was think of another approach recently though
A more probabilistic approach: When invoking a Typst function we measure how long it took. This does not let us skip hashing right away. But if we call it a bunch of times and it was always cheap, the next time we can skip memoization and hashing altogether. If it is more expensive in a later run, we can deopt and start memoizing again. (All of this would probably happen on the Typst level rather than the comemo level.)
Could that information be cached between compiles? It would presumably be a cheap way to improve "cold" compiles
A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set.
I have been trying to find an application for a bloom filter in Typst for years but I have never found one where it really seems to make sense. I think it's mostly useful for really huge data sets.
I was doing a chat gpt, when you talked about the probabilistic approach. ๐คทโโ๏ธ
I don't know but I still don't really want to open that can of worms. Where would you even store it?
I don't know. It was just a hypothetical thought
Bloom filters are a really cool data structure.
let's make a target folder ๐
@left night two atomics are slower ๐ฆ
basically goes back to main perf
Not that surprising I guess. The spinlock wasn't contended.
I'm the guy who just throws out dumb ideas to see if anything sticks ๐
indeed since we're single threaded
null state is 0 :-p
it should be fine I guess
Atomic<Option<u128>> isn't an acceptable type
for atomic because it doesn't impl some trait
we are relying on no collisions after all
And below you, I'm the guy who throws a reference based on keywords in conversation. Powerful team ๐
and that'd be just another collision
worst case here is that it will rehash every time
which isn't too terrible
true
since I would expect a hash of 0 to be rare
like 1/2^128 levels of rare ๐
expecting an "ackchually" momment from our resident mathematicians
have you tried hashing just T yet?
like... no prehashing at all?
yes, the two atomics are just T
the two atomics?
the version with two atomics is just T being hashed
the LazyHash approach *
okay, so this was faster than all of Inner being wrapped in LazyHash?
well using two atomics is slow as heck
I'll try doing a single Atomic<u128> and then test both
what I'm talking about should be orthogonal to one vs two atomics
yes I am testing all four cases
but that's bound to take some time since release build take 2m30s ๐
I know, it sucks
btw have you merged the fix by @cunning wadi for watch
because it's a really annoying bug ๐
I will merge it soon. I wanted to test it again on my system before merging.
@left night I can confirm that a single atomic/spinlock is faster than two atomics, but whether it's just T or T + location, etc. doesn't matter much, it seems to me like in watch, T + other fields is slightly faster
okay
@sturdy sequoia I'm just curious. What about using salsa in the typst?
https://rustc-dev-guide.rust-lang.org/salsa.html
Salsa is a library for incremental recomputation.
A guide to developing the Rust compiler (rustc)
This was discussed ages ago. Something, something rust analyzer is not librarified enough.
I actually don't remember.. But it was considered deeply.
I had looked at salsa before building comemo, but ultimately didn't feel like it would work well for Typst. But they are ultimately at the same level of abstraction, so we are already more or less doing what they do.
It seems to me (through a quick read-through) that they support slightly more advanced features, but I don't think they'd gain us anything significant
It's just a different design. It is more advanced in some aspects, but also more rigid in others.
There you go @left night as we discussed the lazy hash thingy
I am a little bit curious whether 128bit CAS can be used to atomically update the hash without lock, which should be available in most hardware
According to the atomic crate, 128-bit CAS is only available on AARCH64, make of that what you will ๐ฆ
Shouldnโt be
I did check in the core doc for rust, and it only has u128 CAS on ARM too
Perhaps it would require target-cpu=native on x86 that's why it's not supported by default?
or it requires nightly 
I am not so sure but I believe 128bit cas is very important (because of the ABA problem) so most recent arch should support it
That can be true
I think the core lib doesnโt implement this
Maybe this might be a better crate to use as we only need 128bit atomic
Indeed, lemme swap it and test it ๐
Sweet
They also indicate that 128bit load and store can be supported via avx
So we donโt need a cas
On the other hand, I am very curious how likely we are going to have race condition?
That is indeed quite a bit faster
zero atm since we don't use multithreading in the "engine" so-to-speak
but we will eventually
Oh nice to hear
we already parallelized comemo fully to make this possible in the future
but we have (so far) decided that in such cases, racing is okay, at worst you're just doing the same compuation twice, and it's likely that the cost of additional sync would be higher than the gains from it
I was a little bit curious do we need really need atomic or not๐ค
Yah I would like to see multithreaded and your work there really interest me!
I mean typst could mostly be done with only Rc and RefCell and stuff like that, but it limits this multithreaded work
Because hash should be constant for every instance
actually @left night has a multithreaded branch
well yes, but here we want to compute it only once for the value
Very nice
that's the trick, it saves tons of computation on nested structures (which happen a lot in typst)
Yah I am just thinking whether we can perform an optimistic concurrent control there to avoid atomic
I mean technically it's true that racing doesn't matter, it's more the possibility of the structure having been modified by another thread the issue
and therefore the hash having been reset
Oh it can be reset?
Then never mind it would make things much harder
If the hash is only computed once and never changed then I think it might be useful to consider that
yes, that was the previous Prehashed which just stored a u128
now the idea is to still hash on-demand, but store the hash anyway
No I mean once it is stored why changed?
I haven't tested, but I wonder how a RwLock<u128> performs too ๐ค
Might be very expensive
In case the value being hashed is changed
like a Content to assign its Location, stuff like that
It seems to provide more than what we need๐ค
Well wait if a content is changing then atomic hash is not gonna help?
We need to hold the things with a lock (maybe rwlock) to protect the potential that hash doesnโt match the content?
the hash gets atomically reset when the value is modified when the DerefMut impl is called
Yes but if multithreaded
in that case yes, the inner value would be held in some lock
the problem is that when Hash::hash is called you don't have a mutable reference
therefore you need to use interior mutability
Another thread can still call it and compute the hash even after reset but before the content change available to it
in addition, the value might have already been modified and it just needs to be hashed on another thread
no, because you'd have a lock around the LazyHash<T>
which means if you can mutate it, you can't also have a reference to the hash inside of it
Ah ok
I didnโt really thought in the way rust allows
Thatโs interesting
Then I donโt really think we need a atomic128
yes indeed, it appears that using a parking_lot::RwLock is about as fast ๐
I guess they implement some very lightweight rwlock
And doesn't introduce an additional dependency which I like
you have to think of it more as being much lighter than actually re-computing the hash ๐
What would be the performance of a plain u128? As we donโt have multhrrad yet
Haha of course
can't do that since we need mutable access from an immutable reference
Well a little be unsafe?
Just for testing
Fine, but just for you 
Would it be helpful if only one thread is computing the hash and other just wait? But probably computing the hash is fast enough that cost of synchronization is higher
But we canโt benchmark without a real multithreaded model๐ซจ
Maybe we can gain a lot as this is definitely a hot path๐ค
Yes
Too bad syncunsafecell is not available in stable
You would also need to unsafe impl Sync and Send I suppose๐
Oh might because we have zero contention
The rwlock is just as cheap as an u64 atomic read/write๐
Even mutex should perform the same I believe
Thanks for testing hh maybe atomic is cheaper than I thought
atomics on a modern x86 CPU should be relatively cheap in the grand scheme of things
I am not expert in arch in which I should๐ช
Yeah should be
Especially without contention
exactly
But why can we gain this๐ค
Shouldnโt the atomic crate use a very lightweight lock
Maybe just some error on the benchmark
my guess is that the exact implementation it uses differs and is somehow slower?
This is likely very microarchitecture dependant
Probably
I know that my other machine with a 12900k performs very differently in typst mostly due to lower memory latency than my main machine (7950x3d)
yes
I have no idea how parking lot can be faster
parking lot is just stupidly well optimized ๐
True๐
I updated the PR ๐
I see they have a dark magic of using the elision but I have no idea about that
Sweet
Anyway, I'm off to bed it's 2AM here ๐ญ
Oh good night
I found it, absolutely crazily optimized
Transactional Synchronization Extensions (TSX), also called Transactional Synchronization Extensions New Instructions (TSX-NI), is an extension to the x86 instruction set architecture (ISA) that adds hardware transactional memory support, speeding up execution of multi-threaded software through lock elision. According to different benchmarks, T...
that should says that parking_lot is probably as fast as an u128
I'm also curious about this ๐
lol Intel had to disable it all the way to skylake ๐๐๐๐
And it was removed in comet lake from every sku sad
tsx was a security nightmare
well I guess it makes a lot of sense on a machine that only runs tasks that you have verified, but I can see how it would be a nightmare on a regular PC
lol
Actually @sturdy sequoia said that it is verified by the single mutable reference rule in rust, so I suppose itโs almost like immutable
Do you know whether it is the issue with transactional memory in general or just intels issue?๐ค
I think Intel is the only one that pursued this. At least wikipedia says that AMD abandoned their corresponding tech
Advanced Synchronization Facility (ASF) is a proposed extension to the x86-64 instruction set architecture that adds hardware transactional memory support. It was introduced by AMD; the latest specification was dated March 2009. As of October 2013, it was still in the proposal stage. No released microprocessors implement the extension.
I think that arm also have such extension
From what my prof told me, transactional memory is a miracle everyone wants, but no one can implement without introducing terrible SPECTRE-like vulnerabilities. He does not know whether it's just an implementation issue or a deeply-rooted problem due to safe transactional memory is just impossible conceptually. But at least that's the reason why both AMD and Intel abandoned it.
ah good to learn, are you an arch student?
just comp sci with an arch course
curious will the one for arm expose such vulnerability hh? https://developer.arm.com/documentation/102873/0100/Overview
very cool I was planning to wait one prof for the arch course but not successful and will graduate soon ๐
I had to take OS and an arch course to qualify for a computer science master's programme, and for those reasons, I noped out and took a degree in mathematical statistics instead.
oh man OS is fun!
Obviously off-topic, so we can talk about that in #off-topic. Sorry, I did it again..
oops sorry yes
I mean, IBM and a few others also support it
not on x86 mind you
on their POWER architecture
does IBM even make anything other than mainframes at this point?
yes, their POWER 9 is available in servers 
Don't talk down to my darling
one day I'll have one of their CPUs to test

tbf transational memory is good if you trust the code you are running
POWER arch wasn't that Mac in the 00s?
yes
Presumably https://www.ibm.com/z ?
yeah basically
Anyway, it wasn't meant in a negative way
I told you not to download the transactional memory CPU extensions
Anyway, the damage is done now... It cannot be stopped
This is my type of humor. Absolutely hilarious.
So @sturdy sequoia, what's your immediate intentions with the beloved VM?
First let's get LazyHash merged for the nice bump in perf
then I think it's pretty much ready
not merge ready, but the perf is in a good place!
Cute little PR.
ikr

Sad
Also I may have made a confusion in #discussions
I think once I said 10000 tables of 5000 cells compiled in 1 second
But that's actually a separate benchmark
I think it was like 400 tables or so
lol
10000 i think takes 30s or so
:p
But yeah after your PR I'm sure we're getting to 1s ๐
Trust
Hรฉhรฉ no
Is hรฉhรฉ French laughter?
it's more of an evil laugh
I have a really dumb optimization idea: every time document recompiles, choose like 80% of the functions and don't memoize them
that's just plain cursed lol
I think having an estimate of the "size" of the input and the "complexity" of the function would be smarter and clearly would just workโข๏ธ
motivation is that
- 20% will still memoize something, and maybe something worth memoizing
- If a call worth memoization, it is expected to get memoized in 5 interations anyway
- Who cares about the first few recompilations anyway, when typst-preview does than 3 times a second?
- It decreases memory usage and potentially causes less hashing
I think having function attribute in typst to opt out of memo is a powerful concept.
yes, better than what @atomic violet is suggesting
although I wouldn't mind some of whathever he's smoking ๐
it ain't my fault ๐ญ
Heuristics heuristics are truly powerful concept.
Basically, I'm implemented by a collection of heuristics on top of each other.
Well if you want to make a heuristic for when memoization is worth it, go ahead ๐
I am thinking of a simple threshold based system, like if it takes over say 100ยตs, it gets memoized
with that threshold being adjustable of course
I think this could greatly reduce memory size and reducing hashing, what do you think @left night
Guess it wouldn't reduce hashing actually
since you'd still need to check whether it's in the cache
So it's not worth it

Can you count instructions?
real ones or vm ones?
I have a power9 server at home as well!!!!
well I mostly used it as a desktop :^)
how do we measure time?
yes, that's the issue we can't do it either way
there is no good way, we must memoize from the get go
I think we can measure the instruction cost like https://llvm.org/docs/CommandGuide/llvm-mca.html
And always memorize recursive functions
Problem is we can't really do that for typst functions (i.e functions written in typst) ๐
Maybe try not caching some certain functions? For example functions with locate, etc. That is, those that are likely to always be called with different inputs?
#1176509648707256370 message
in 5 iterations you would expect 67% of functions to be memoised following that scheme, not 100%
unless you like, systematically do 20% of the total each recomp and never choose the same one twice
Thank you.
I had forgor about that ๐
But I might give it a whir in the VM
The statistics gods are appeased... for now
@sturdy sequoia PR looks nice, just a few comments, in particular about the unsafe, which I think isn't necessary. I think once merged I will remove Prehashed from comemo. It is too opinionated for it, in particular also the PartialEq and Eq implementations.
Nice!
@left night regarding that PR, I was wondering something, it's now trivial to make Array and Dict use a LazyHash (such that hashes are conserved), should we do that? ๐ค
so what cursed performance shenanigans have been done since last release?
For Dict it's trivial but not for Array because of EcoVec and the memory size of Value
cursed majestic LazyHash and few easy wins, followed by a massive VM that improves memory and performance of typst tremendously. The latter is not yet merged, and probably won't make it into next release. But it is in a state of glory atm.
Declaration: I'm not involved in any of the work, think of me as a correspondent
https://github.com/python/cpython/blob/5e29021a5eb10baa9147fd977cab82fa3f652bf0/Python/ceval.c#L1269-L1305 interesting comment I found
I am afraid doing threaded interpreting may cover the code in the hell of unsafe, and actually won't be as effective because typst is simply not as well optimized as python, but it does not mean it should be disregarded entirely
isn't python a special case because of the GIL?
not sure if it's related
I was afraid of someone saying that
"threaded code" is unrelated to multithreading, it is a way of writing switch statements
The more you know
actually, looks like llvm may just do that without any unsafe hackery ๐ค
https://github.com/ziglang/zig/issues/8220
https://internals.rust-lang.org/t/computed-gotos-tco-threaded-interpreters-experiments-and-findings/4668
so maybe the code is already compiled that way
very unlikely though, llvm probably saw the first bound check in the interpreter loop and went "uhm, yeah, I don't know what the hell the dumb human wrote here... yeah screw optimization, gotta go recompile that regex crate instead" ๐
Question is, how would I give it... more information to make it optimize smarter?
looks like "rewrite it in zig" is an answer ๐
seriously though, idk, compilers are a hard guys to negotiate with
||N|| ||O||
||O||
what
performance time ๐
#contributors message maybe this belongs here
Isn't it a big accuracy concern more than speed (but also speed)
The difference in accuracy is likely extremely minimal
I would be curious if the == E even works since you know... floats
I would argue the int vs float one likely makes sense, the others? probably not so much, would need to see how libc does it
it works if you literally pass calc.e and that's the whole purpose of it
I guess that makes sense
I just read part of this: https://kyju.org/blog/piccolo-a-stackless-lua-interpreter/
This design might work quite well for typst
a stackless design could be helpful for catching mistakes leading to infinite loops (or, for that matter, someone trying to DoS typst.app)
not sure how useful being able to continue a piece of computation in Typst code would be, though
Well, the Typst compilation happens in the browser via WASM.
my existing implementation is stackless ๐
it just has an infinite number of registers
I mean to work on it but between work and now this week which sucked ass (one of my doggo went to the great park in the sky ๐ข ) it's been difficult to find the motivation and time to work on Typst ๐ฆ
no no, I ended up taking a job in Belgium, and it's great!
My boss is awesome, they're financing my PhD, and they're paying me really well (for a PhD candidate)
and I get to work from home 4 days a week ๐
Definitely not in the US then.
@sturdy sequoia I feel a little bad asking this, but since the bytecode VM work seems very stalled, should we perhaps close the PR for the time being? We can always reopen it in the future. I'd just like to keep the PR tracker from growing and growing.
I think that's fair no worries ๐
I'll work on it at some point, but motivation has been tough lately
You know what, I am doing it tonight, I need to work for 1-2 hours, but after that, it's VM time โค๏ธ
Won't be done tonight
but I'll try and merge with upstream
(the existing code I meant)
How did it go?
Working on it 
I'll see if I can finish tonight after work
I walked the 20kms of Brussels yesterday, my feet are โจ destroyed โจ
But, I had a significantly easier time than last year, so that's a good sign ๐
Is this a race or something?
No rush. We just love hearing from you.
it's a semi marathon, but after runners there's a walking race too
Very nice!
Thanks โค๏ธ
@left night I am thinking of splitting the VM into its own crate (i.e typst-vm), but this creates the issue of having external components, etc. The way I am thinking of doing it is by having a Vm<T: VmImpl> where VmImpl allows the creation of all the relevant values (a function to create a SpaceElem, etc.). Because adding all of the (fairly long) VM code into typst feels a bit ugly, what do you think?
Actually that really doesn't work ๐ฆ
Did you envision typst depending on typst-vm or the other way around?
I think only typst-vm depending on typst is at all feasible since the VM will need access to many of the central data structures. I have thought a bit about how to shatter the crates in the future, and my conclusion was that we need most of the type definitions in a core crate, but could extract computations into separate crates (rustc works similarly) which are dynamically "linked" to together through a table of function pointers produced by one thin top-level crate that depends on the core and all computation crates. But I don't think it's necessary to do this for the VM now, we can do it for everything at once sometime in the future.
I was thinking on typst depending on the VM, admittedly it could be the other way arround, but I'll keep doing what I am currently doing: shortening the heck out of the VM, I am trying to make it short enough that it's not 10k lines long
@left night I have changed the logic for import from dynamic values (i.e paths which are not known at compile time): I am now allowing it (which I think it's cool since I saw a package the other day rely on it) but for it to work you must specify the imports manually or use the name of the import (i.e if I import codly, I need to use codly)
I think it's a good tradeoff, if the import can be done at compile time: then it is done fully at compile time
Complexity wise it's really quite okay
so basically not allowing dynamic wildcard imports, right? I think that makes sense. They are cursed.
yes exactly
everything else should work as before hopefully
do you disallow the import cetz.draw: *? I think it is quite often used.
i think they might be referring to something like #import "./the" + file: ...
but not 100% sure
no, that would work
because I can get cetz and .draw at compile time, I have a specific case for these kinds of uses!
It really only is what @glad urchin showed: completely dynamic imports
note that this would also not work:
let cetz2 = cetz
import cetz2.draw
but this limitation could be lifted with a const evaluation and inlining
Actually it would be fairly easy to remove this limitation, I may do it
Just not just yet ๐
Ok, I have implemented it, it's dead easy with the architecture
We are soooo back baby!!
oooooh YEAAAAAAAAAAAAAh
It's also quite a bit shorter
and it is quite a bit more powerful with dynamic imports (I know some packages rely on them) and automatic constant inlining which imo is pretty darn cool ๐
(I should note that the constant inlining it almost free)
BTW I should mention that the new compiler produces much better data structures which may improve performance quite a bit by using linear structures instead of tree-like structures, so I am hopeful it will be better ๐ค
Wdym new compiler
I am partially re-writing it while I port it to the latest typst
mostly simplifications
We really need some sort of CI for performance
Would've detected the performance regression in the 0.11 release candidate for instance
You're right. But to be honest, typst's very pertinent performance issue is the memory usage...
There is no issue at all with testing memory usage in CI, regardless of instancing.. and even hardware..
Alas memory monitoring tools suck. I wouldn't let my worst enemy configure ci mem consumption mont scripts.......
Why not both
I think that memory consumption should be a bit higher priority than compile/inc-compile performance, so I was a lil slick with that relocation of the goal post...
I remember there was discuss about it. It is troublesome since git action bot may have different hardware to affect absolute performance. So we need to set up some platform independent environment for tracking performance.
Besides, we need a nice set of documents for benchmark ๐
And incremental-compilation isn't that easy to test either...
but if you start with memory consumption then that isn't such a big issue.
Good metrics.
I used some heap profiler, which instruments heap (de)allocations for profiling. Since typst's comemo heavily uses memory, some small documents will take seconds to collect data. I doubt whether we can perf both CPU and memory at the same time then.
It was fixed, but I still think 0.11 is a bit slower in some areas
we might not need to have both the AST and bytecode present at once
in fact, we could use an arena for the AST and/or avoid creating an explicit AST altogether
that I caused or that somebody else did?
I don't think it was you
Do I need to git blame and spank someone? 
