#Performance

1 messages ยท Page 5 of 1

atomic violet
#

I removed the Prehashed<...> thing from it and it's like 3 times faster now lmao

#

@sturdy sequoia wake up I found the hasher

#

so we are like 5 times slower than python now ๐Ÿค”

atomic violet
#

in Method

sturdy sequoia
#

oh no

#

๐Ÿ˜ญ

#

I am so dumb

#

Well done โค๏ธ

atomic violet
#

You did better โค๏ธ

lunar kettle
#

is it faster than main now? ๐Ÿ‘€

atomic violet
#

It is!

lunar kettle
#

niice

#

how much?

atomic violet
#

like 1.5x maybe

#

something like that

lunar kettle
#

i mean that's something!

sturdy sequoia
#

Not great not terrible

lunar kettle
#

baby steps ๐Ÿ˜„

atomic violet
#

callgrind now reports that the slowest function in the projects is Value::hash?? Somehow??...

#

I don't think I trust it though

#

it's not a very smart function ๐Ÿค”

sturdy sequoia
sturdy sequoia
#

somewhere

#

||over the rainbow||

atomic violet
#

I don't see it

#

all the calees of hash are call_closure

#

(which is expected)

sturdy sequoia
#

yeah but closures are wayyyy too slow

#

I need to improve that

atomic violet
#

enter_scope is also pretty slow

#

like 1/4 of what closure is

sturdy sequoia
#

except some std::mem::swap

atomic violet
#

Well.. yeah, it does swaps

sturdy sequoia
#

they should be cheap, no?

#

unless it's copying ungodly amounts of ram

#

(did someboday say "wrap it in a box"?)

atomic violet
#

let me check the code, idk ๐Ÿค”

sturdy sequoia
#

On my machine it's now about twice as fast as main ๐ŸŽ‰

#

(mind you I have a more optimized version than you, I'll push gimme a minute)

#

There you go, I just pushed ๐Ÿ˜‰

atomic violet
#

the Joiner is not that small ๐Ÿค”

sturdy sequoia
#

True, it's surprisingly large

#

but like that's just a single memcpy, right?

#

RIGHT?

atomic violet
#

it should be, I think

#

I want to disassemble it but I am scared of going through inlined garbage

#

does every iteration of the loop requires enter_scope?

sturdy sequoia
atomic violet
#

ok I see

sturdy sequoia
#

I have my meeting with the Ayar Labs folk in 25 minutes

#

I am scared

#

uwu but sad

atomic violet
#

sad? why?

sturdy sequoia
#

well more like terrified

#

so

#

๐Ÿ˜ฑ

atomic violet
#

you got this

sturdy sequoia
#

Thanks bro โค๏ธ

feral imp
#

the more meetings, the less demand is going to be on you fam.. Just breath and be yourself.

atomic violet
#

yeah

#

if you can write a typst vm, you 100% can uhm...

#

give photons a little push to turn a little? idk

sturdy sequoia
#

My thesis now compiles 12% faster!

#

We're getting somewhere

atomic violet
#

so seemingly nothing special about it

#

... and then run is inlined so it makes so sense afterwards ๐Ÿ˜ญ

#

An interesting thing to do may be to memoize hash of value right in that same value, possibly with interior mutability, because it seems to me that same objects get hashed over and over again. In my call graph it goes up to 7 levels of hashing deep, and I have no idea how is it even possible. I don't have anything 7 levels deep in my raytracer, I think ๐Ÿค”

#

about 1/6 of all memory reads is spent hashing level 2 objects actually

#

this is a weird call graph though ๐Ÿค”

sturdy sequoia
atomic violet
#

if big functions accept deeply nested objects and hash them over and over again, than it's just as bad

sturdy sequoia
#

because it might make most of the cost of execution being hashing when computing the function is cheaper

atomic violet
#

well... yeah, I suppose this would be a start

sturdy sequoia
#

it's fairly easy to do now since we can just put a threshold on the number of instructions

atomic violet
#

you can do some simple complexity analysis

#

ACTUALLY

#

It's quite powerful

#

because you can introduce "move-sematics"

#

or something like that

#

because small functions may not require clone on write

#

especially when they are not memoized

#

so maybe it's possible to not copy objects => to not bump reference counter => to make writes in them cheap because they no longer require clone

atomic violet
#

I love compilers

left night
#

Right now, with Value being an enum, it might be tricky

atomic violet
#

Yeah I saw that

#

I wonder whether it's possible to gain something by just hashing, say, dicts and arrays

#

because if I understand correctly: 1. They are used very actively; and 2. They do require recursive hashing, and it is pretty slow

atomic violet
#

ok no sorry I won't do that right now - turns out it's not that simple ๐Ÿ˜…

tight glade
#

There was a release mode? (Jkjk)

tight glade
sturdy sequoia
#

Ok, so, with main it runs 155e9 instructions while with the VM it runs 76e9 instructions

#

quite an improvement @atomic violet

#

Oof

#

it literally OOM'ed ๐Ÿ˜ฆ

#

I just wanted higher resolution uwu

#

oh no my WSL just OOM'ed

#

like, the entire thing

#

@atomic violet that's the highest resolution I could get ๐Ÿ˜

tight glade
sturdy sequoia
#

you ๐Ÿฅ”

#

โค๏ธ

feral imp
#

So... It is twice as fast?
Also, typesetting is what we are here for, but this might be the ideal thing to build a vm. I still don't really get it.

Thesis still at 14%?

#

(new phone)

tight glade
sturdy sequoia
#

Hey, I did some improvement, it's 10% faster than it was five minutes ago (relative to that)

sturdy sequoia
#

it's actually only 10% faster since main is a bit faster than before (some changes by @left night)

#

but I mean it's 10% in my thesis, in something like @atomic violet's raytracer is twice as fast

#

it also uses a ton less ram

#

which is nice

feral imp
#

But dude didn't you say eval was only 20 pct if your thesis anyways?

sturdy sequoia
#

I was like 40-ish commits late ๐Ÿ’€

feral imp
#

Alright. Taken ram usage down a notch would be great too............

sturdy sequoia
#

and I just gained another ~5% on the raytracer ๐Ÿ˜‰

#

I mean it brings my thesis down to 5.5s compile time

#

that's pretty darn good if I do say so myself

#

down to just 55B instructions (real ones) from 155B on the latest main!

#

And I think there is still room

sturdy sequoia
#

and on my thesis it goes from 120B isr down to 115B

sly pecan
sturdy sequoia
#

Unfortunately I have a (vscode?) bug that prevents me from getting good measurements in incremental because it somehow saves the files twice

sturdy sequoia
#

I lost some performance (for better code quality & clarity) but I regained it somewhere lese sunglassed_crying

low sapphire
cunning wadi
sturdy sequoia
sturdy sequoia
#

Because if so I don't mind fixing it locally and testing it

#

(since I can reproduce the issue)

#

I can confirm that it fixes it

#

I'll open a PR right now ๐Ÿ˜‰

left night
sturdy sequoia
#

@left night Have you thought about adding a swap_remove method to EcoVec, my reasoning is that for arguments, we use a linear lookup into an array of Args and then we remove them, but that means that for each argument we eat, consume, etc. we do a full memcpy of the contents of the arguments. While that is not necessarily slow, it's wasted CPU instructions & memory bandwidth as opposed to a swap_remove!

#

I've actually done a ton of optimizations to other aspects of closures and this seems to be the main remaining bottleneck

left night
sturdy sequoia
#

Or am I stupid?

#

'cause then you're probably right

#

๐Ÿ˜ญ

#

but but memcpy ๐Ÿ˜

#

Ok, the VM is on the verge of being 3x faster than main

#

๐Ÿ’ช

#

and the code is still readable

sturdy sequoia
left night
#

It would require an iterator style approach

#

or we could leave behind empty slots or something

sturdy sequoia
#

I'll try that real quick

sturdy sequoia
#

@atomic violet if you wanna try it I just pushed loads of optimizations ๐Ÿ˜Ž

sly pecan
#

so what is the status of The Thesisโ„ข๏ธ ?

#

@sturdy sequoia

sturdy sequoia
#

Without the math equations *

sly pecan
sturdy sequoia
#

the information is in the instruction, I just don't use it yet

sly pecan
sturdy sequoia
#

Only two of them are disabled

#

three equations that use functions *

sly pecan
#

so how much faster is it? 20%?

sturdy sequoia
#

Which isn't much

#

but I think it can be further improved by decreasing the cost of instantiating a VM

#

which is possible mind you

#

But, it uses 32% less RAM

#

which is amazing too imo

#

and well worth it

feral imp
#

I think a thesis shouldn't consume 2 gigs in the first place... So honestly, it might be very good.......

sturdy sequoia
sly pecan
#

2 GB for an entire thesis isn't that bad

feral imp
left night
sturdy sequoia
#

I have no idea why it does that

#

Weird, I retook my measurements and got different results from yesterday

#

It's still lower, but only by 10%

#

๐Ÿค”

#

Which I guess makes more sense

#

It gets slightly better when factoring incremental going to ~15% better after a few changes

#

so there is definitely something here

#

just less than I had measured

#

weird

#

Maybe I misread? ๐Ÿ’€

#

Now I confirm, on @atomic violet's raytracer it's 32% better

#

so that number didn't come from my thesis but the ray tracer

#

@atomic violet that's awesome

low sapphire
#

webgpu support when?

feral imp
sturdy sequoia
sturdy sequoia
#

And by one I mean 10% on the raytracer

#

๐Ÿ˜Ž

#

I managed to remove sooooooo much cloning

tight glade
#

Nice! โค๏ธ

surreal hemlock
sturdy sequoia
#

Ok my thesis is now 12% faster (which doesn't seem like a lot, but it's nice anyway)

#

I have done tons of improvement to show and set rule handling

#

reducing cloning, complexity, etc.

#

@left night what do you think of making the argument names as PicoStr

#

since those are pretty much always very similar

#

it would make argument matching much cheaper overall

sturdy sequoia
# left night we can try

It's actually a fairly simple change, I modified the macros to directly build the PicoStr using a static lazy, and then I made a pico macro to do the same everywhere else (to avoid rehashing the keys everytime)

left night
sturdy sequoia
#

but it would not be wasm compatible afaik ๐Ÿ˜ฆ

left night
#

You know what's funny: The wasm binary is in some respects optimized in ways the local one cannot be.

#

Specifically, when control flow depends on comparing addresses of statics, which happens when checking which kind of element some content is

sturdy sequoia
#

I didn't know ๐Ÿค”

#

Why tho?

left night
#

Because the native binary is ASLR-compatible, so the addresses are resolved at startup time

#

Meanwhile, for WASM the Rust compiler does also only provide the addresses at link time (so post-optimization) BUT after that we run wasm-opt which can work with the constants.

sturdy sequoia
#

of course!

#

That's actually kind of awesome ๐Ÿ˜‚

#

So named arguments as PicoStr, did it help? yes, but not as much as I was hoping ๐Ÿ˜ฆ

#

I expected that it would make hashing Args cheap enough to help, but it seems that generally the values of the args are the expensive bit ๐Ÿ˜ฆ

#

Is it worth it? I think so because it also re-uses existing systems we have in place for other, similar things, but I am dissapointed that it doesn't do more ๐Ÿ˜ฆ

#

It does improve memory usage by about 80MB on my thesis

#

so it's something ๐Ÿ˜‚

left night
#

seems like it's worth it

left night
sturdy sequoia
#

I actually implemented something similar for closures: they have a lazy Value::Func(_) inside of them so that when they get called multiple times, it doesn't need to reallocate a Repr::Closure

#

BTW, using PicoStr on @atomic violet saved around 300M instructions out of 49B instructions (real CPU ones that is)

feral imp
#

0.61224489795 %. Not bad.

sturdy sequoia
feral imp
sturdy sequoia
#

I just gained another 500M instructions ๐Ÿ˜Ž

feral imp
#

I'd love time in between commits as well, then I can make a timeseries of the reduced instructions..

sly pecan
#

It all adds up

#

500 million here, another 500 million there

#

and baby you've got a stew goin

sturdy sequoia
#

Especially since the changes I'm doing are not very intrusive

#

I'm doing my best to keep the code readable and explain why I'm doing what I'm doing

left night
#

where would it allocate multiple times?

#

also: regarding performance improvements. If some of the ones you are doing are independent of the VM (e.g. PicoStr), it would be great if those could be done independently in PRs

sturdy sequoia
#

essentially, closures are treated in two steps, when they're compiled, they become a CompiledClosure, then they get instantiated (the idea here is to pre-do as much work on the closure as possible, and doing the capture "in situe" since we need the actual values to capture). After instantiation you have a Closure (which is ready to be called). That process is memoized, but since a named closure is declared within itself (to allow recursion), that process is stored inside of the Closure using a OnceCell, the idea here being that it enables me to avoid the Prehashed<...> to be done again neededlessly on subsequent instantiation of the closure

sturdy sequoia
feral imp
sturdy sequoia
#

Another 2B instructions spared ๐Ÿ˜Ž

#

(real world instruction that is *)

sturdy sequoia
#

@left night I did end up trying the dynamic size register thing (6, 16, 32, and then infinite using a vec) and on the thesis it's pretty awesome: we are now 15% faster!

#

So an extra 5% just from that change

#

!!!

#

And the size of the binary barely increased (about 200KB)

left night
sturdy sequoia
#

smallvec gave barely any improvements

left night
#

I am confused why this would give a benefit

sturdy sequoia
#

I'll profile it now

left night
#

did you try limiting the amount of registers during compilation, but keeping vec storage?

sturdy sequoia
#

for example the whole of tablex compiles with a max number of registers of ~40

left night
#

I have this feeling that the gain can also be had without generics

#

not sure how though

sturdy sequoia
left night
#

so does it pick between the VM sizes at runtime based on the number of registers used in compilation?

#

what if you use a slice pointing to an array on the stack instead of a smallvec?

sturdy sequoia
#

So, yes, it looks how many registers it used and uses an array (or a vec if too big) based on that

#

it behaves like a slice during evaluation regardless whether it's an array or a vec

#

my idea is that small functions and closures that get called often will be more effective due to not allocating a Vec and using the stack instead

#

another theory is that it gets cached immediately in these small closures ๐Ÿค”

left night
sturdy sequoia
#

because it gave no gains but made it use way more stack

left night
#

Okay but still

#

using a slice would get rid of generics while keeping it non-allocating

sturdy sequoia
#

Oh yes, that's even better, it's the same "advantage" but doesn't use generics

#

nice

#

advantage * not complexity

#

my brain is fried ๐Ÿ’€

sly pecan
#

(sounds great though!)

sturdy sequoia
atomic violet
#

it's representative of 70% of my documents, so as a typical typst user, I agree

surreal hemlock
#

@sturdy sequoia I can send you some notes I made for a course (around 35 pages), if you want. It is mostly text and math equations with a few cetz drawings. Could be another interesting testing document. Last time I checked, cold compilation was around 1 sec

sturdy sequoia
#

I will need to fix math then ๐Ÿ˜‚

sly pecan
#

I think I did

surreal hemlock
sturdy sequoia
#

lol

#

coordinate balls

#

lmao

#

๐Ÿ˜‚

sturdy sequoia
#

Wiener measure

#

lol

sturdy sequoia
#

Hey, I was reading that @untold turret ๐Ÿ˜‚

#

Do you mean that I could use a thread-local pools of "pre-used" vecs?

#

Because I cannot use only one vec forever ๐Ÿค”

#

(since I can call functions recursively)

untold turret
sturdy sequoia
#

functions and modules *

sturdy sequoia
untold turret
#

๐Ÿฑthen vmstate will be cheap to allocate I think.

untold turret
untold turret
onyx furnace
#

good allocators maintain a arena of different size and other complex structs. in some simple cases like

for (...) {
vector<..> v
}

the memory will very likely be reused(maybe also thanks to optimizer)

#

i didnt read the whole thread though. just feel recent thing sounds weird.๐Ÿ™

untold turret
sturdy sequoia
#

So, I was in bed, but I did do the change, it seems to me to be positive overall, I introduced some bugs while doing it which I haven't debugged yet but it definitely makes the code shorter and easier to follow!

sturdy sequoia
#

@left night any idea what might be causing this error:

#

The span is clearly valid since it can make an error from it

left night
#

The argument span seems to be missing

sturdy sequoia
#

it's weird because I think it's there

sturdy sequoia
#

@left night you were right, I was reading the wrong spans and out of chance it was reading a Span::detached() ๐Ÿ’€

sturdy sequoia
#

While doing some cleanup I've also managed to gain about 1B real CPU instructions on @atomic violet's raytracer

#

that's like 3%, but it's something!

feral imp
#

That's good. I mean you've got a huge implementation, and you being able to improve it, also means it is certainly a little cohesive. So not only are you gaining performance ground, you're giving credence to the VM.

sly pecan
glad urchin
#

Done!

surreal hemlock
#

Just add a raytrace instruction. Easy

sturdy sequoia
#

don't worry ๐Ÿ˜‚

atomic violet
#
; typst.asm
section .data
     msg: db "Do you really need that document though?.. Think about all the time you will have to go outside and touch grass, if you stopped writing that paper", 0
section .text
global _start
_start:
    mov eax, 1
    mov edi, 0
    mov rsi, msg
    mov edx, 100
    syscall
    mov eax, 60
    mov edi, 1
    syscall
#

typst 2025

atomic violet
tight glade
sturdy sequoia
#

@left night any objection to the use of a Box<[T]>? On the struct I want to use it in, it saves > 128 bytes just from all of the capacity: usize being saved

sturdy sequoia
#

note that it's read only after creation

#

and it doesn't require re-allocations

#

it just removes the capacity field afaik

feral imp
#

Recent blogs talks about this.... Owned slice thing. So people are adopting that pattern.

sturdy sequoia
#

The thesis now compiles 18% faster!

#

(than main)

glossy shore
#

yooo

feral imp
sturdy sequoia
sturdy sequoia
feral imp
#

That's good!?

sturdy sequoia
#

I don't like it!

#

it's basically wasted work

sly pecan
#

Is it because the hash is 128 bit that it's slow @sturdy sequoia ?

sturdy sequoia
#

๐Ÿ˜ฆ

sly pecan
#

How much faster is 64 bit?

#

Yes I know we can't use it

sturdy sequoia
sly pecan
#

The thing is

sturdy sequoia
sly pecan
#

So

#

Hear me out

#

Only fall back on 128 bit on an actual collision

#

Or use two different 64 bit hashes

#

Instead of 128

sturdy sequoia
sturdy sequoia
sly pecan
feral imp
#

What do you mean wasted work?
VM allows for decreasing memory usage; that's a win
VM gained 18% performance increase on an actual document: that's a win too
VM enchanced raytracing example to now be.. 2-3 times within python performance; double win I guess?

sturdy sequoia
sly pecan
#

Presumably these functions don't even have to be memoized, since they're never called with the same argument twice

feral imp
#

Implementing RenderMonkey in typst is of course an important goal, but for now, we can just bask in the glory of having typst jump leaps and bounds afar from what it was mere months ago. I guess that's a small win.

feral imp
sly pecan
#

I mean, it's not a terrible idea

feral imp
sly pecan
#

Yes

feral imp
#

AFK meeting

sturdy sequoia
#

I increased the ray bounce

sly pecan
sly pecan
sturdy sequoia
#

(obvious /s)

sly pecan
#

Anyway short of faster 128 bit hash, or the disabling of memoization for specific functions, I'm not sure there is much you can do

sturdy sequoia
#

๐Ÿ˜ฆ

sly pecan
#

How is hashing of multiple arguments handled? Is it one hash for everything?

sturdy sequoia
#

but yet you get one big hash out

sly pecan
#

In general, does numeric input even need hashing?

#

I guess that's overcomplicating things though @sturdy sequoia

sturdy sequoia
#

@sly pecan it's more that we don't know they're numeric anyway

#

Because it's essentially a Vec<Value> where Value can be numeric or it can be anything

sturdy sequoia
sly pecan
#

Do they have to be?

#

(sorry for the questions, I'm aware they're naive)

sturdy sequoia
#

or we would get into trait hell for every single value we want to pass into comemo

feral imp
#

Full disclosure, I wish this was possible for my own interests / projects ๐Ÿ™ƒ

sturdy sequoia
#

The only solutions I can think of are:

  • Using a faster hashing function (assuming one exists that meets the requirements)
  • Detecting whether memoization is worth it, but this is a double edged sword, to check whether it's worth it, you would still need to hash the input arguments to check whether it's memoized so it's kind of... useless
#
  • Annotate functions that don't need memoization (in typst that is)
#
  • Caching hashes as suggested by @left night in the Value rework
atomic violet
#

does typst has any good incremental benchmarks?

#

all this talk is more about incremental compilation, than a single compilation, and I wonder if there are ways to measuring recompilation already?

sly pecan
#

Presumably

#

At least

atomic violet
#

I like option 3 as an advanced feature for package authors

#

Disabling memoization is like adding #[inline], I feel. Probably won't help much, but if you know what you are doing, go for it

#

the one downside, I feel, is that if Typst memoization mechanism ever changes internally, it may affect API too

sly pecan
atomic violet
#

It may

#

the question is whether it can help, say, cetz

#

it will also help in this context #1176509648707256370 message

#

maybe when custom types are going to come out there can be a distinciton like "we cache function, but not methods"

sly pecan
tight glade
#

Dumb idea but have you tried turning comemo off to see if it actually makes performance gains in the eval step?

feral imp
tight glade
#

So im actually really smart? ๐Ÿ”ฅ

left night
#

There is a lot of recursive redundant hashing going on currently

left night
sturdy sequoia
sturdy sequoia
left night
sturdy sequoia
#

it's a bit convoluted, but it could definitely work because upon initial hashing, that can be done in parallel

#

divide and conquer type of approach

left night
#

Tree arrays?

#

That's ambitious

sturdy sequoia
#

so an array of 1000 elements would be like 4 arrays of 250 elements or something

#

Might be completely dumb mind you

ornate merlin
# sturdy sequoia I mean, we should first hash to 64-bit then to 128-bit hashing using the 64-bit

Maybe you can use 64bit hashing and direct comparison?

I mean we have two ideas:

  1. Collisions may rarely occur even when using the 64-bit version.

  2. A function is rarely called twice with the same arguments.

Then we can do as follows:

  1. Do not use hushing under any circumstances.

  2. Switch to using a 64-bit hash plus direct comparison. So if we don't have the same function hashes, obviously they will be different. At the same time, any collisions will be caught by direct comparison. This strategy will actually be more reliable than even using a 128-bit hash.

sly pecan
west light
#

What does python do for hashing. It can be quickish?

shy sage
ornate merlin
feral imp
#

There must be a hashing, that is made, for detecting hashing of a similar object, that has changed ever so slightly.. ๐Ÿ˜…

ornate merlin
feral imp
ornate merlin
feral imp
sturdy sequoia
# ornate merlin Maybe you can use 64bit hashing and direct comparison? I mean we have two ideas...
  1. Well, the state of the art we're based on, tend to use 128-bit hashes, I would assume that it does come from experimentation. Although, I agree that for most cases it would likely be enough.

  2. That's not really true: consider that typst runs your code multiple times in a raw until your document stabilizes, this means that all of your query(...), locate(...), etc. get called N times (up to five), memoization does save a lot of time there

sturdy sequoia
sturdy sequoia
sturdy sequoia
# shy sage I don't know rust, if it helps, there was a discussion for Godot to replace the ...

gxhash would likely be a good alternative too since it's in rust but it currently has no fallback for wasm and unsupported platform (it uses AES instructions in the CPU for acceleration) and is, to some extent, good, it's not cryptographically secure but it's on-par with SipHash 1-3 for collision resistance and DOS protection. The problem with xxHahs is mostly that it's a C dependency and we try to avoid those because they can lead to linking hell on WASM

ornate merlin
sturdy sequoia
sturdy sequoia
# ornate merlin Yes, direct comparison is very expensive, but it can occur only in one billion o...

While I agree, the main problem is detecting problematic collisions, let's take a simple case:

  • You have a function calls twice with two arrays
  • They cause a 64-bit collision
  • You have to either compare the arrays in their entirety (therefore needing to store all previous arguments)
  • Or you must have some kind of additional information that you can use to dissambiguate which is not necessarily trivial ๐Ÿ˜
#

For arrays, you could maybe just store the length and have a tuple (u64, usize) with the hash and the length, but do you do that for each argument?

#

It gets really tricky

ornate merlin
#

Hmm... Creating unique ID for each function and track it๐Ÿ˜‚

sturdy sequoia
#

that sounds awefully like... a hash ๐Ÿ˜„

ornate merlin
sturdy sequoia
#

we already cache per-function

#

admittedly, since we store per-function maybe we could do with 64-bit hashes

#

since functions aren't called a bagilion times

#

๐Ÿค”

#

@left night what do you think?

shy sage
sly pecan
#

@sturdy sequoia I honestly feel like anything other than a 128 bit hash might just be overcomplicating things. The raytracer is pretty much worst case scenario

#

How much time is spent hashing on a more sane document?

atomic violet
sturdy sequoia
sturdy sequoia
sturdy sequoia
#

I confused it with another hash library most likely

ornate merlin
sturdy sequoia
sturdy sequoia
ornate merlin
atomic violet
#

I feel like hashes are so fast to compute, you will lose a shit tone of cycles due to syncronization overhead and false sharing ๐Ÿ’€

atomic violet
#

hmmmm... maybe not false sharing, actually, due to COW semantics ๐Ÿค” ....

#

but it's still won't be very fast, semaphors are not free

sly pecan
#

I'm curious by the way, does x86-64-v3 improve that? @sturdy sequoia

sly pecan
sturdy sequoia
sturdy sequoia
#

I did try it and it wasn't much

atomic violet
#

SIMD in Typst ๐Ÿ‘€

sturdy sequoia
#

Swapping allocators is much more impactful

sly pecan
#

I assume you've tried two different 64 bit hashes instead of one 128 bit?

#

(yes i know it's unlikely to be faster)

atomic violet
#

I think optimizing hashing now will yield inaccurrate results in the long run because semantics of Value are about to change

#

I would rather wait a bit until API changes will stabilize, and only then try to squeeze out cycles out of hashing

#

it's not that hard to do later, doing it now will probably just complicate further design changes and more productive optimizations

#

(let alone increased compile times and binary size)

ornate merlin
sly pecan
#

That's the difference between functionally never to maaaaaybe

#

And it really is important that it never happens

sturdy sequoia
sturdy sequoia
#

The thing is that other than caching hashes, no matter the hash function, it will still show up

#

hashing over and over again will always be there

#

the difference is whether we can cache those results

#

and the answer is: yes we can

#

On a side note: Damn Obama was so charismatic, can we go back please?

sly pecan
sturdy sequoia
#

Knowing you'll be in the US too

feral imp
#

You know that guy is still alive right? And he has time? Maybe we can ask him to be spokesperson for typst? (Obama!)

sturdy sequoia
#

He recently gave a talk in Belgium for the meesly sum of 650kโ‚ฌ, I'm sure <@&1200368100202258552> can get that much cache just for him ๐Ÿ™‚

feral imp
#

pay?! are you nuts? I'll just show him the gradients+VM+raytracer, and he'll do it probono.

sturdy sequoia
#

never forget

left night
# sturdy sequoia <@311948531835469827> what do you think?

I am not convinced that the 128 bit hashes or even the hash function is a real problem. I think it's really the hashing of the same thing over and over again.

Until we do the value rework, if you want you could experiment with doing the lazy hash only on content rather than arbitrary values. We can then remove the Prehashed on flow, par, etc. and see if we get gains or not.

Jumping to conclusions and risking hash collisions in real documents seems unwise to me.

sturdy sequoia
#

I'm thinking of an Atomic<u128> that gets reset when the DerefMut impl of Packed is called

#

something like that

left night
#

you can also play with whether the hash is just for the element data or also the metadata

#

if it's the former, no invalidation is necessary for metadata changes

sturdy sequoia
#

I was first trying to set it inside Inner<T> since it should be pretty easy to detect modifications there

#

@left night is it normal that Content::with_mut doesn't check uniqueness? ๐Ÿ˜ฑ

#

There's definitely something, my initial kind of baboon-level implementation does speed things up, not as much as I was expecting, but it does make a difference

left night
glossy shore
sly pecan
sturdy sequoia
#

@left night I removed all Prehashed<Content> (I had forgor ๐Ÿ’€) and I am shocked, it really does give a nice bump in performance across the board, I think this really is the way to go for Value has well

#

It does contain the label, location, lifecycle, etc. but not the span (which tbf is small enough that we shouldn't care too much imo)

sturdy sequoia
#

So 90% of the time the hash was already stored

#

This is also the case in incremental but sometimes (depending on change) drops down to 82%, so I am wondering whether not pre-hashing the location would be beneficial

#

to avoid recursively hashing the T which is likely the most expensive anyway

left night
sturdy sequoia
left night
#

sounds good!

left night
sturdy sequoia
#

since it's simple swap and store operations, you can just do atomic memory operations on it ๐Ÿ˜‰

#

load and store *

#

I'm tired ๐Ÿ˜ด

left night
sturdy sequoia
#

And there is an AtomicU128 in core apparently ๐Ÿ˜ฑ

left night
#

looking at the crates source it sort of seems like they use a spinlock on stable rust

sly pecan
#

What do you think @left night ?

sturdy sequoia
sturdy sequoia
#

You're right, on nightly it would have 128-bits atomic

#

I mean it's still faster ๐Ÿ˜„

left night
#

I wonder whether a spinlock is more efficient than two AtomicU64

sturdy sequoia
#

AtomicU128 is actually only available on ARM64 cores ๐Ÿ˜ฑ

left night
#

Switch dynamically between a hash and the raw value?

sturdy sequoia
#

since comemo is now generic it could be done very efficiently

sly pecan
#

I think using the raw value instead of the hash

#

Together with the type

left night
#

But that will pretty much always be larger than the hash

sly pecan
#

We're talking only for integers essentially

left night
#

Note that this wouldn't apply to Typst functions that take integers. Just to Rust functions that are memoized.

sly pecan
#

Aha

left night
#

There are almost none of those that take just integers.

#

Maybe actually none.

sly pecan
#

Hopes and dreams dashed

left night
#

I was think of another approach recently though

#

A more probabilistic approach: When invoking a Typst function we measure how long it took. This does not let us skip hashing right away. But if we call it a bunch of times and it was always cheap, the next time we can skip memoization and hashing altogether. If it is more expensive in a later run, we can deopt and start memoizing again. (All of this would probably happen on the Typst level rather than the comemo level.)

sly pecan
#

Could that information be cached between compiles? It would presumably be a cheap way to improve "cold" compiles

feral imp
left night
feral imp
#

I was doing a chat gpt, when you talked about the probabilistic approach. ๐Ÿคทโ€โ™‚๏ธ

left night
sly pecan
#

I don't know. It was just a hypothetical thought

left night
sturdy sequoia
#

@left night two atomics are slower ๐Ÿ˜ฆ

#

basically goes back to main perf

left night
#

Not that surprising I guess. The spinlock wasn't contended.

sly pecan
#

I'm the guy who just throws out dumb ideas to see if anything sticks ๐Ÿ™‚

sturdy sequoia
left night
#

do you have an Atomic<Option<u128>> or what is the null state?

#

just 0? :p

sturdy sequoia
left night
#

it should be fine I guess

sturdy sequoia
#

Atomic<Option<u128>> isn't an acceptable type

#

for atomic because it doesn't impl some trait

left night
#

we are relying on no collisions after all

feral imp
left night
#

and that'd be just another collision

sturdy sequoia
#

which isn't too terrible

left night
#

true

sturdy sequoia
#

since I would expect a hash of 0 to be rare

#

like 1/2^128 levels of rare ๐Ÿ˜‚

#

expecting an "ackchually" momment from our resident mathematicians

left night
sturdy sequoia
left night
#

no like not lazy hashing location etc

#

only the element itself

sturdy sequoia
#

yes, the two atomics are just T

left night
#

the two atomics?

sturdy sequoia
#

the LazyHash approach *

left night
#

okay, so this was faster than all of Inner being wrapped in LazyHash?

sturdy sequoia
#

I'll try doing a single Atomic<u128> and then test both

left night
#

what I'm talking about should be orthogonal to one vs two atomics

sturdy sequoia
#

but that's bound to take some time since release build take 2m30s ๐Ÿ’€

left night
#

I know, it sucks

sturdy sequoia
#

because it's a really annoying bug ๐Ÿ’€

left night
sturdy sequoia
#

@left night I can confirm that a single atomic/spinlock is faster than two atomics, but whether it's just T or T + location, etc. doesn't matter much, it seems to me like in watch, T + other fields is slightly faster

ornate merlin
feral imp
#

I actually don't remember.. But it was considered deeply.

left night
sturdy sequoia
left night
sturdy sequoia
#

There you go @left night as we discussed the lazy hash thingy

lavish tree
#

I am a little bit curious whether 128bit CAS can be used to atomically update the hash without lock, which should be available in most hardware

sturdy sequoia
lavish tree
#

Shouldnโ€™t be

sturdy sequoia
#

Perhaps it would require target-cpu=native on x86 that's why it's not supported by default?

#

or it requires nightly bing_shrug

lavish tree
#

I am not so sure but I believe 128bit cas is very important (because of the ABA problem) so most recent arch should support it

lavish tree
#

I think the core lib doesnโ€™t implement this

lavish tree
sturdy sequoia
#

Indeed, lemme swap it and test it ๐Ÿ˜‰

lavish tree
#

Sweet

#

They also indicate that 128bit load and store can be supported via avx

#

So we donโ€™t need a cas

#

On the other hand, I am very curious how likely we are going to have race condition?

sturdy sequoia
#

That is indeed quite a bit faster

sturdy sequoia
#

but we will eventually

lavish tree
#

Oh nice to hear

sturdy sequoia
#

we already parallelized comemo fully to make this possible in the future

#

but we have (so far) decided that in such cases, racing is okay, at worst you're just doing the same compuation twice, and it's likely that the cost of additional sync would be higher than the gains from it

lavish tree
#

I was a little bit curious do we need really need atomic or not๐Ÿค”

#

Yah I would like to see multithreaded and your work there really interest me!

sturdy sequoia
lavish tree
sturdy sequoia
sturdy sequoia
lavish tree
#

Very nice

sturdy sequoia
#

that's the trick, it saves tons of computation on nested structures (which happen a lot in typst)

lavish tree
sturdy sequoia
#

and therefore the hash having been reset

lavish tree
#

Oh it can be reset?

#

Then never mind it would make things much harder

#

If the hash is only computed once and never changed then I think it might be useful to consider that

sturdy sequoia
#

now the idea is to still hash on-demand, but store the hash anyway

lavish tree
#

No I mean once it is stored why changed?

sturdy sequoia
#

I haven't tested, but I wonder how a RwLock<u128> performs too ๐Ÿค”

lavish tree
#

Might be very expensive

sturdy sequoia
#

like a Content to assign its Location, stuff like that

lavish tree
#

Ah ok

#

I thought types avoid mutable stuff to make things easier

lavish tree
#

Well wait if a content is changing then atomic hash is not gonna help?

#

We need to hold the things with a lock (maybe rwlock) to protect the potential that hash doesnโ€™t match the content?

sturdy sequoia
lavish tree
#

Yes but if multithreaded

sturdy sequoia
#

the problem is that when Hash::hash is called you don't have a mutable reference

#

therefore you need to use interior mutability

lavish tree
#

Another thread can still call it and compute the hash even after reset but before the content change available to it

sturdy sequoia
#

in addition, the value might have already been modified and it just needs to be hashed on another thread

sturdy sequoia
#

which means if you can mutate it, you can't also have a reference to the hash inside of it

lavish tree
#

Ah ok

#

I didnโ€™t really thought in the way rust allows

#

Thatโ€™s interesting

#

Then I donโ€™t really think we need a atomic128

sturdy sequoia
lavish tree
#

I guess they implement some very lightweight rwlock

sturdy sequoia
#

And doesn't introduce an additional dependency which I like

sturdy sequoia
lavish tree
#

What would be the performance of a plain u128? As we donโ€™t have multhrrad yet

#

Haha of course

sturdy sequoia
lavish tree
#

Well a little be unsafe?

lavish tree
#

Just for testing

sturdy sequoia
lavish tree
#

Would it be helpful if only one thread is computing the hash and other just wait? But probably computing the hash is fast enough that cost of synchronization is higher

#

But we canโ€™t benchmark without a real multithreaded model๐Ÿซจ

#

Maybe we can gain a lot as this is definitely a hot path๐Ÿค”

sturdy sequoia
#

Well it doesn't even compile using unsafe ๐Ÿ˜‚

#

or I'd need to use an UnsafeCell

lavish tree
#

Yes

#

Too bad syncunsafecell is not available in stable

#

You would also need to unsafe impl Sync and Send I suppose๐Ÿ˜‚

lavish tree
#

The rwlock is just as cheap as an u64 atomic read/write๐Ÿ˜‚

#

Even mutex should perform the same I believe

sturdy sequoia
#

Using a raw u128 in an UnsafeCell basically doesn't gain anything ๐Ÿคทโ€โ™‚๏ธ

lavish tree
#

Thanks for testing hh maybe atomic is cheaper than I thought

sturdy sequoia
lavish tree
#

I am not expert in arch in which I should๐Ÿ˜ช

#

Yeah should be

#

Especially without contention

sturdy sequoia
lavish tree
#

Shouldnโ€™t the atomic crate use a very lightweight lock

#

Maybe just some error on the benchmark

sturdy sequoia
#

This is likely very microarchitecture dependant

lavish tree
#

Probably

sturdy sequoia
#

I know that my other machine with a 12900k performs very differently in typst mostly due to lower memory latency than my main machine (7950x3d)

lavish tree
#

The atomic crate cache a bunch of spin lock

#

Well no idea๐Ÿ˜ช

sturdy sequoia
lavish tree
#

I have no idea how parking lot can be faster

sturdy sequoia
lavish tree
#

True๐Ÿ˜‚

sturdy sequoia
#

I updated the PR ๐Ÿ˜‰

lavish tree
#

I see they have a dark magic of using the elision but I have no idea about that

#

Sweet

sturdy sequoia
#

Anyway, I'm off to bed it's 2AM here ๐Ÿ˜ญ

lavish tree
#

Oh good night

lavish tree
#

I found it, absolutely crazily optimized

#

that should says that parking_lot is probably as fast as an u128

tight glade
sturdy sequoia
#

And it was removed in comet lake from every sku sad

sturdy sequoia
lavish tree
lavish tree
sly pecan
lavish tree
#

I think that arm also have such extension

atomic violet
#

From what my prof told me, transactional memory is a miracle everyone wants, but no one can implement without introducing terrible SPECTRE-like vulnerabilities. He does not know whether it's just an implementation issue or a deeply-rooted problem due to safe transactional memory is just impossible conceptually. But at least that's the reason why both AMD and Intel abandoned it.

lavish tree
#

ah good to learn, are you an arch student?

atomic violet
#

just comp sci with an arch course

lavish tree
lavish tree
feral imp
lavish tree
#

oh man OS is fun!

feral imp
#

Obviously off-topic, so we can talk about that in #off-topic. Sorry, I did it again..

lavish tree
#

oops sorry yes

sturdy sequoia
#

not on x86 mind you

#

on their POWER architecture

sly pecan
#

does IBM even make anything other than mainframes at this point?

sturdy sequoia
#

Don't talk down to my darling

#

one day I'll have one of their CPUs to test

atomic violet
#

tbf transational memory is good if you trust the code you are running

feral imp
#

POWER arch wasn't that Mac in the 00s?

sly pecan
#

yes

sly pecan
sturdy sequoia
sly pecan
#

Those are mainframes though, IN YOUR FACE

#

๐Ÿ™‚

sly pecan
#

Anyway, it wasn't meant in a negative way

sturdy sequoia
sly pecan
sturdy sequoia
glad urchin
# sly pecan

I told you not to download the transactional memory CPU extensions

#

Anyway, the damage is done now... It cannot be stopped

feral imp
#

So @sturdy sequoia, what's your immediate intentions with the beloved VM?

sturdy sequoia
#

then I think it's pretty much ready

#

not merge ready, but the perf is in a good place!

glad urchin
#

Will 10000 tables compile in 0 seconds now?

#

(Note: 0.1ms is not acceptable)

sturdy sequoia
glad urchin
#

Also I may have made a confusion in #discussions

#

I think once I said 10000 tables of 5000 cells compiled in 1 second

#

But that's actually a separate benchmark

#

I think it was like 400 tables or so

#

lol

#

10000 i think takes 30s or so

#

:p

#

But yeah after your PR I'm sure we're getting to 1s ๐Ÿ‘

#

Trust

sturdy sequoia
sly pecan
sturdy sequoia
#

รจ_รฉ

sturdy sequoia
atomic violet
sturdy sequoia
atomic violet
#

I have a really dumb optimization idea: every time document recompiles, choose like 80% of the functions and don't memoize them

sturdy sequoia
#

I think having an estimate of the "size" of the input and the "complexity" of the function would be smarter and clearly would just workโ„ข๏ธ

atomic violet
#

motivation is that

  1. 20% will still memoize something, and maybe something worth memoizing
  2. If a call worth memoization, it is expected to get memoized in 5 interations anyway
  3. Who cares about the first few recompilations anyway, when typst-preview does than 3 times a second?
  4. It decreases memory usage and potentially causes less hashing
feral imp
#

I think having function attribute in typst to opt out of memo is a powerful concept.

sturdy sequoia
#

although I wouldn't mind some of whathever he's smoking ๐Ÿ˜‚

atomic violet
sturdy sequoia
feral imp
#

Heuristics heuristics are truly powerful concept.

#

Basically, I'm implemented by a collection of heuristics on top of each other.

sturdy sequoia
#

I am thinking of a simple threshold based system, like if it takes over say 100ยตs, it gets memoized

#

with that threshold being adjustable of course

#

I think this could greatly reduce memory size and reducing hashing, what do you think @left night

#

Guess it wouldn't reduce hashing actually

#

since you'd still need to check whether it's in the cache

#

So it's not worth it

feral imp
#

Can you count instructions?

sturdy sequoia
feral imp
#

Either would do for determining memo potential?

#

๐Ÿค”

low sapphire
#

well I mostly used it as a desktop :^)

sturdy sequoia
#

there is no good way, we must memoize from the get go

untold turret
#

And always memorize recursive functions

sturdy sequoia
ornate merlin
left night
glossy shore
atomic violet
#

well it's going to be 97% in 15

#

not 100%, sure, but you don't need 100% either

glossy shore
#

unless you like, systematically do 20% of the total each recomp and never choose the same one twice

sturdy sequoia
#

But I might give it a whir in the VM

sturdy sequoia
left night
#

@sturdy sequoia PR looks nice, just a few comments, in particular about the unsafe, which I think isn't necessary. I think once merged I will remove Prehashed from comemo. It is too opinionated for it, in particular also the PartialEq and Eq implementations.

sturdy sequoia
#

@left night regarding that PR, I was wondering something, it's now trivial to make Array and Dict use a LazyHash (such that hashes are conserved), should we do that? ๐Ÿค”

quaint blaze
#

so what cursed performance shenanigans have been done since last release?

left night
feral imp
atomic violet
#

I am afraid doing threaded interpreting may cover the code in the hell of unsafe, and actually won't be as effective because typst is simply not as well optimized as python, but it does not mean it should be disregarded entirely

sly pecan
#

not sure if it's related

atomic violet
#

"threaded code" is unrelated to multithreading, it is a way of writing switch statements

atomic violet
#

very unlikely though, llvm probably saw the first bound check in the interpreter loop and went "uhm, yeah, I don't know what the hell the dumb human wrote here... yeah screw optimization, gotta go recompile that regex crate instead" ๐Ÿ˜‚

sturdy sequoia
atomic violet
#

seriously though, idk, compilers are a hard guys to negotiate with

sturdy sequoia
#

||O||

sturdy sequoia
#

ping

#

||what could this mean? ๐Ÿ˜Ž||

sly pecan
#

what

untold turret
#

performance time ๐Ÿš€

sly pecan
#

#contributors message maybe this belongs here

feral imp
#

Isn't it a big accuracy concern more than speed (but also speed)

sly pecan
sturdy sequoia
#

I would argue the int vs float one likely makes sense, the others? probably not so much, would need to see how libc does it

left night
sturdy sequoia
cunning wadi
#

This design might work quite well for typst

feral imp
#

@sturdy sequoia VM Part II?

#

Subtitle: Stackless is Thankless

silk wedge
#

not sure how useful being able to continue a piece of computation in Typst code would be, though

low sapphire
sturdy sequoia
#

it just has an infinite number of registers

#

I mean to work on it but between work and now this week which sucked ass (one of my doggo went to the great park in the sky ๐Ÿ˜ข ) it's been difficult to find the motivation and time to work on Typst ๐Ÿ˜ฆ

lunar kettle
#

Are u in the US now btw?

#

(Sorry kinda off topic ๐Ÿ˜‚)

sturdy sequoia
#

My boss is awesome, they're financing my PhD, and they're paying me really well (for a PhD candidate)

#

and I get to work from home 4 days a week ๐Ÿ˜Ž

silk wedge
#

I wish I had a job remotely as good as that

#

Iโ€™m comically introverted so ๐Ÿ˜ฆ

left night
#

@sturdy sequoia I feel a little bad asking this, but since the bytecode VM work seems very stalled, should we perhaps close the PR for the time being? We can always reopen it in the future. I'd just like to keep the PR tracker from growing and growing.

sturdy sequoia
#

I'll work on it at some point, but motivation has been tough lately

sturdy sequoia
#

You know what, I am doing it tonight, I need to work for 1-2 hours, but after that, it's VM time โค๏ธ

#

Won't be done tonight

#

but I'll try and merge with upstream

#

(the existing code I meant)

feral imp
#

How did it go?

sturdy sequoia
#

I'll see if I can finish tonight after work

#

I walked the 20kms of Brussels yesterday, my feet are โœจ destroyed โœจ

#

But, I had a significantly easier time than last year, so that's a good sign ๐Ÿ˜„

feral imp
sturdy sequoia
sturdy sequoia
sturdy sequoia
#

@left night I am thinking of splitting the VM into its own crate (i.e typst-vm), but this creates the issue of having external components, etc. The way I am thinking of doing it is by having a Vm<T: VmImpl> where VmImpl allows the creation of all the relevant values (a function to create a SpaceElem, etc.). Because adding all of the (fairly long) VM code into typst feels a bit ugly, what do you think?

#

Actually that really doesn't work ๐Ÿ˜ฆ

left night
#

I think only typst-vm depending on typst is at all feasible since the VM will need access to many of the central data structures. I have thought a bit about how to shatter the crates in the future, and my conclusion was that we need most of the type definitions in a core crate, but could extract computations into separate crates (rustc works similarly) which are dynamically "linked" to together through a table of function pointers produced by one thin top-level crate that depends on the core and all computation crates. But I don't think it's necessary to do this for the VM now, we can do it for everything at once sometime in the future.

sturdy sequoia
#

I was thinking on typst depending on the VM, admittedly it could be the other way arround, but I'll keep doing what I am currently doing: shortening the heck out of the VM, I am trying to make it short enough that it's not 10k lines long

sturdy sequoia
#

@left night I have changed the logic for import from dynamic values (i.e paths which are not known at compile time): I am now allowing it (which I think it's cool since I saw a package the other day rely on it) but for it to work you must specify the imports manually or use the name of the import (i.e if I import codly, I need to use codly)

#

I think it's a good tradeoff, if the import can be done at compile time: then it is done fully at compile time

#

Complexity wise it's really quite okay

left night
sturdy sequoia
#

everything else should work as before hopefully

untold turret
glad urchin
#

but not 100% sure

sturdy sequoia
#

because I can get cetz and .draw at compile time, I have a specific case for these kinds of uses!

#

It really only is what @glad urchin showed: completely dynamic imports

#

note that this would also not work:

let cetz2 = cetz

import cetz2.draw
#

but this limitation could be lifted with a const evaluation and inlining

#

Actually it would be fairly easy to remove this limitation, I may do it

#

Just not just yet ๐Ÿ˜„

sturdy sequoia
#

Ok, I have implemented it, it's dead easy with the architecture

feral imp
#

We are soooo back baby!!

sturdy sequoia
#

It's also quite a bit shorter

#

and it is quite a bit more powerful with dynamic imports (I know some packages rely on them) and automatic constant inlining which imo is pretty darn cool ๐Ÿ˜Ž

#

(I should note that the constant inlining it almost free)

sturdy sequoia
#

BTW I should mention that the new compiler produces much better data structures which may improve performance quite a bit by using linear structures instead of tree-like structures, so I am hopeful it will be better ๐Ÿคž

sturdy sequoia
#

mostly simplifications

sly pecan
#

We really need some sort of CI for performance

sly pecan
#

Would've detected the performance regression in the 0.11 release candidate for instance

feral imp
feral imp
#

I think that memory consumption should be a bit higher priority than compile/inc-compile performance, so I was a lil slick with that relocation of the goal post...

untold turret
#

Besides, we need a nice set of documents for benchmark ๐Ÿ˜‰

feral imp
#

And incremental-compilation isn't that easy to test either...

feral imp
untold turret
sly pecan
#

It was fixed, but I still think 0.11 is a bit slower in some areas

silk wedge
#

we might not need to have both the AST and bytecode present at once

#

in fact, we could use an arena for the AST and/or avoid creating an explicit AST altogether

sturdy sequoia
sly pecan
sturdy sequoia
#

Do I need to git blame and spank someone? angryeyes