#Performance
1 messages Β· Page 4 of 1
That's a great improvement already
I'd go further and give a name to the Register | Local parameter
e.g. output=...
I guess
btw what do the field types become in the end?
the struct wouldn't exists as is π
it would be more like
struct Add<'a>(&'a mut Executor);
hard to say. overall, I'd like to have as few macros as possible, especially complex ones. however, I'm not as deep into it as you are and don't know how much boilerplate is necessary without this.
It would also produce:
struct AddBuilder<'a>(&'a mut Compiler);
impl<'a> AddBuilder<'a> {
pub fn lhs(mut self, lhs: impl Into<LhsArg>) -> Self {
...
}
}
enum LhsArg {
Const(Value),
Register(Register),
...
}
impl Into<...> for LhsArg { ... }
So allowing me to do the following:
impl Compile for ast::Add<'_> {
fn compile(&self, compiler: &Compiler) {
let lhs = self.lhs.compile(compiler)?;
let rhs = self.rhs.compile(compiler)?;
compiler.add().lhs(lhs).rhs(rhs)
}
}
There's unfortunately a lot of boilerplate atm
unfortunately I cannot easily create types with macro_rules π
at least not things like LhsArg, etc.
You'd declare them explicitly
is the code anywhere public?
do other VMs written in Rust use macros for similar things? and either way (yes/no), could we take inspiration from them?
not that I've seen but they're usually fairly simple
what's really causing me trouble here are the joining and weird return, continue, and break semantics
Those are a pain point atm
Which leads to often complex code, other things such as short-circuiting for and and or are also a bit painful π
Another way I can handle some of the complexity is by having dedicated structs for things like loops that would look like:
struct LoopExecutor<'a> {
parent: &'a Executor,
iterator: Box<dyn Iterator<Item = Value>>,
locals: SmallVec<[Value; 4]>,
instructions: &'a [Instruction],
}
And then handle all of the complexity in a nicer way overall
I might just do that
Woud likely be slower mind you
But maybe working with the type-system is nicer?
A single executor would bring more compact code. for the dyn iterator, it could be replaced by a slice or enum.
I think a block abstraction is needed. π€ I note you have both iterator and instructions, where the iterator is not belong to a block statically. I think iterator is the input of a block
struct Block {
iterator: smallvec::Vec,
}
Also let it no box and no dyn.
A serious question about performance. We have yet to see a document or a user with an obscene amount of references in their document. What aspect has an influence on this? Eval? Comemo? Something else?
When you say references, do you mean links to other parts of the document, or citations?
Citations! Sorry π
label query was ever optimized by dherse, but should have room to get further improvement.
Everything can always be improved. I was thinking if this was related to it. This being the VM work. I'm not requesting anything here.
From my knowledge, introspection or introspector handles queries, and this part isn't quite incremental. https://github.com/typst/typst/tree/1612913f8f195248059156a7ae1a08a31c7f5016/crates/typst/src/introspection
The VM wonβt improve query performance outside of any filtering that you might be doing in your code, I am actually very satisfied with the performance of introspection at this point
Good! That's what I just wanted to hear. π
I think there might be ways in the future to improve it further, but not without adding quite a bit of complexity, too much to justify it right now when eval has become such a big bottleneck
To be clear I'm not shitting on eval, it didn't use to be the bottleneck, it has become (at least one of) the bottleneck thanks to the content rework and various other optimizations such as Deferred<T>
And I think that --timings has highlighted that it's slower than we thought
Would hover tooltips work with the VM?
For reference: Hover tooltips are powered by these lines of code.
oh cool
it should, right now it wouldn't work (because I just haven't setup the tracer)
But my goal is pretty much feature parity
The big difference is that I might build a table for span -> instruction to make inspection not require a full recompile
but that's less clear to me yet
why would it require a full recompile?
and wdym with recompile? bytecode recompile?
well if the introspection is done at compile time (when calling Compile for ast::Expr) then it would require recompiling
the bytecode yes
okay
So I'll probably build a table or some kind of smart structure to handle that
Probably just HashMap<Span, usize>
Does it change existing code? If not, it it probably not a recompile.
it's not that it would change existing code
it's that it's easiest to do in the Compile for ast::Expr implementation
but doing it there would force recompiling the bytecode everytime the tracer span changes
And doing it in the VM will require some kind of smarts
Performance was mentioned somewhere else but here: #1199240084336164915 message
Now that I have done my exam
It's time to get back to the VM
by...
completely re-writing it π
Why are you doing this to yourself
I don't know π
What do y'all think of this one:
opcodes! {
Nop = 0x0000,
Add -> Writeable => {
/// The left-hand side of the addition.
lhs: Readable,
/// The right-hand side of the addition.
rhs: Readable,
} = 0x0001,
}
It only create the "readable" instructions, not the builders for the compiler, only the readers for the VM
I wonder, how can you do this to yourself?
I want to learn this power.
It expands into a big enum from which opcodes are created and small structs, here is an example:
// Recursive expansion of opcodes! macro
// ======================================
#[doc = r" No operation."]
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash, Pod, Zeroable)]
#[repr(C)]
pub struct Nop {}
#[doc = r" Adds two values together."]
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash, Pod, Zeroable)]
#[repr(C)]
pub struct Add {
#[doc = r" The left-hand side of the addition."]
pub lhs: Readable,
#[doc = r" The right-hand side of the addition."]
pub rhs: Readable,
#[doc = "The output of the instruction."]
pub out: Writeable,
}
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
pub enum Opcode {
#[doc = r" No operation."]
Nop,
#[doc = r" Adds two values together."]
Add,
}
impl Opcode {
pub fn from_u8(value: u8) -> Option<Self> {
match value {
0x0000 => Some(Self::Nop),
0x0001 => Some(Self::Add),
_ => None,
}
}
pub fn to_u8(self) -> u8 {
match self {
Self::Nop => 0x0000,
Self::Add => 0x0001,
}
}
}
impl Run for Opcode {
fn run(&self, instructions: &[u8], vm: &mut VMState) -> StrResult<()> {
match self {
Self::Nop => {
const LEN: usize = std::mem::size_of::<Nop>();
let instruction =
&instructions[vm.instruction_pointer..vm.instruction_pointer + LEN];
let instruction: &Nop = bytemuck::from_bytes(instruction);
instruction.run(instructions, vm)?;
vm.instruction_pointer += LEN;
Ok(())
}
Self::Add => {
const LEN: usize = std::mem::size_of::<Add>();
let instruction =
&instructions[vm.instruction_pointer..vm.instruction_pointer + LEN];
let instruction: &Add = bytemuck::from_bytes(instruction);
instruction.run(instructions, vm)?;
vm.instruction_pointer += LEN;
Ok(())
}
}
}
}
And then I manually impl the Run for each opcode:
impl Run for Add {
fn run(
&self,
_: &[u8],
vm: &mut VMState
) -> StrResult<()> {
let lhs = vm.read(self.lhs)?;
let rhs = vm.read(self.rhs)?;
vm.write_one(self.out, ops::add(lhs, rhs)?)?;
Ok(())
}
}
I basically designed it to have variable size instructions, and insanely good cache efficiency
or at least as good as I can get it!
And it's also as functional as I can, using methods wherever possible to hide complexity
Well this new macro convinces me! π₯
β€οΈ
The whole point is to make writing instructions as easy as possible
I do see the point but also thankfully it doesn't do everything, i also think it's quite clear!
especially in the future when other people are writing them
Exactly π
I do wonder what's the usecase for an instruction's run impl being aware of the rest of the instructions?
I plan on making any kind of calls or scopes be different VMs (VMs are basically free to create)
That way, I can make some of the more complex semantics of Typst easy to build
such as join on continue, join on break, etc.
And because VMs are basically zero cost (zero allocation, zero computation), all they need are the instruction list
So call function_value would fetch the function's value, which would then also need instructions to execute?
no, a call would call a completely different set of instructions
but for example enter (to enter a new scope) would have the length of instructions it must take with it
It's really just to compensate some weird semantics around return, break, and continue
I really hope there's a better solution
I don't think there is, dealing with joins any other way is super error prone in my attempts
I just don't like statically compiling only to use dynamic things
I understand that, but the joining semantic is what it is and removing it would be pretty bad π
BTW @glossy shore as I am writing the code for scoping, although I'm struggling a bit with lifetimes, it does lead to a very simple system which is easy to understand π
@left night how open are you with a tiny itsy bitsy bit of unsafe? π
I swear it's the first time I've ever struggled with lifetimes this much
I was gonna ask whether it's Rust lifetimes or Typst lifetimes
hold-my-beer
you can try using refcell and if it doesn't panic maybe it's fine
I mean I have to find a solution, but this will work in the meantime π
I'm so curious to see what it looks like!
Right now, a big advantage of the macros and the traits I have made is that it's significantly shorter
The new VM is almost fully implemented
and boi oh boi was it easier to implement
Even the more complex stuff just goes a lot more smoothly
Does it go vroom vroom though?
nice!
fair question π
Your typst documents will compile so fast your fans won't have time to go vroom vroom π¦
I suggest a little bit of code at the end of typst's main function ```rs
loop {
std::thread::spawn(|| {
let mut x = 0.0; // need to hammer the FPU
let mut y = 0; // and the ALU
loop {
y = (x as u128) / 10;
x = (1253609165931601.5 / y.sqrt()) as u128;
}
});
}
TODO: write stuff to disk, use GPU, heat up RAM
I notice that this bit of code will simply produce NaNs
There is only one major thing missing: break, continue, and return
but thankfully it's relatively easy to do
What about goto?
it's all easy to do π
Now laptops double as electric heaters.
Fun fact, on a laptop, you might run out of bandwidth in the integrated NoC or ring-bus/whathever in your CPU before you manage to load everything all at once π
On a Desktop too BTW, with the first gen Epyc server CPUs, you could easily overload the infinity fabric and cause bottlenecks when using all of the PCIe lanes at once π
So.. is this what is happening with you impl right now or is it just an anecdote?
crashing the CPU would be a bit much for typst as a risk π
just an anecdote, overloading your CPU's bus is... hard
especially on a single thread π
Ok, only one thing missing and then I can test: including other files
Oops, I forgot I also need a compiler π
Ok, the VM is done (but untested) now it's time for the new compiler
But I'll probably take a long break now
Break good, it is Saturday β€οΈ
Sunday*
It's Sunday already???πππ
ikr
Now I write the new compiler π
I thought long break meant a few weeks
Dherse takes a nap and wakes up with Keanu Reeves on top of him saying "Wake the fuck up Vinaigrette. We have a compiler to write" and the heavy metal starts playing
I've seen it...
I was actually taking a spa π
The new compiler is π₯ mind you
it's not done (obviously)
but it's taking shape and is much much nicer
This is the new functional instruction approach
I think it's overall nicer than the old one
I'm so excited π
Accurate π
Looks pretty!
@sturdy sequoia did you miss an optimization? https://github.com/typst/typst/pull/3297
π±
this is a clone of Expr<'a>, which is possible very very cheap. It copies 24 bytes data for each "loop in typst", and make your CPU 0.0001% hotter.
I think it is a simple refactor than a performance fix.π±
Great perfectionism to performance.
couldn't this be abstracted away just a bit with a self.instruction etc.?
Probably
I wanna go over your changes, but it sounds daunting
I'll be publishing the new compiler soon
it's not tested yet
but the architecture is pretty much done imo
@left night is there a way to turn a Source into a Span?
(it's for timing annotations)
Other than doing source.root().span() that is
No, that wouldn't make sense. A span always identifies a syntax node.
For those keeping tab, that's about 10k lines total now that both the compiler and VM are implemented (still untested mind you)
Would still be epic if you get 4x compilation time gain. I'm very excited.
But maybe that's the wrong attitude.. Because this is meant to make typst language faster, right? Like just overall.
The new VM can already do more than the old one 
π For example?
The old one being the one I was working on last week π
It has perfect handling of show set rules, of joining, etc. already
have you rebased?
π
not yet π
You can find the latest code here
WARNING: THIS IS STILL VERY BUGGY
AND STILL PANICS A LOT
Ok, loops work correctly, so do conditions, making good progress
unfortunately I am out of time for this week π¦
i dont believe you
@left night I don't think the following makes sense:
#let identity(x) = x
#let out = for i in range(5) {
"A"
identity({
"B"
break
})
"C"
}
#test(out, "AB")
Essentially we are breaking somewhere that is supposed to be an argument, which leads my VM to fail this test. But I think it has the correct behaviour here by not joining the "B" with the "A". This is because the VM is the block as being preparation for an argument and doesn't join it on exit π
I could emulate this behaviour, but I think this use case should be rare enough to not be an issue?
it will join a and b and produce "AB"
I added the assertion
wait
with the current behaviour?
I missed the identity
lol
thought it was some random function and it was joining with the thing inside the block
if it was just a block, not an argument, it would work btw
well currently it kinda does π
but not in my VM
it's joining with the output of identity
if i understood correctly
it breaks as soon as identity finishes
yes because control flow can go "through" a function
which I really don't think should happen
then i think we have tobe careful here
cuz that seems to be pretty fundamental
other larger cases could subtly break
true
But I still find this "transitive control flow" behaviour weird as heck
I can emulate it, don't get me wrong
but I kinda don't want to π
It feels so wrong
i agree it's weird but there's probably some use case we're missing
it's not always obvious when the underlying thing in the compiler is being used, just by looking at the tests
i guess you'd have to run them equipped with a debugger
-------------------------------
Enter { span: Span(925904528655), len: 327, scope: O0, flags: 6, out: Some(Some(R0)) } += 18
- Enter(327)
Copy { span: Span(4166570378940), value: none, out: J } += 13
Instantiate { span: Span(6712807832736), closure: F0, out: R0 } += 26
Copy { span: Span(31943706238530), value: none, out: J } += 39
Args { span: Span(41723572822423), capacity: 1, out: R3 } += 54
PushArg { span: Span(41955048954585), value: C0, out: R3 } += 67
Call { span: Span(41318489591138), closure: A1, args: R3, flags: 0, out: R2 } += 83
Iter { span: Span(41318489591138), scope: O1, len: 152, iterable: R2, flags: 1, out: Some(Some(R1)) } += 103
- Enter(152)
JumpLabel(0) += 0
0 => Enter: Enter { len: 327, scope: O0, flags: 6, out: Some(Some(R0)) } <= 18
0 => Copy: Copy { value: none, out: J } <= 13
13 => Instantiate: Instantiate { closure: F0, out: R0 } <= 13
26 => Copy: Copy { value: none, out: J } <= 13
39 => Args: Args { capacity: 1, out: R3 } <= 15
54 => PushArg: PushArg { value: C0, out: R3 } <= 13
67 => Call: Call { closure: A1, args: R3, flags: 0, out: R2 } <= 16
83 => Iter: Iter { scope: O1, len: 152, iterable: R2, flags: 1, out: Some(Some(R1)) } <= 20
0 => Next: Next { out: R0 } <= 11
11 => Enter: Enter { len: 110, scope: O2, flags: 2, out: Some(Some(J)) } <= 18
0 => Copy: Copy { value: S0, out: J } <= 13
13 => Args: Args { capacity: 1, out: R0 } <= 15
28 => Enter: Enter { len: 22, scope: O3, flags: 2, out: Some(Some(R1)) } <= 18
0 => Copy: Copy { value: S1, out: J } <= 13
13 => Break: Break <= 9
22 => Exit: "B" + 22
- Writing to R1
110 => Exit: "A" + 110
- Writing to J
152 => Exit: "A" + 152
255 => Copy: Copy { value: C1, out: J } <= 13
268 => Args: Args { capacity: 1, out: R2 } <= 15
283 => Eq: Eq { lhs: R1, rhs: S3, out: R3 } <= 15
298 => PushArg: PushArg { value: R3, out: R2 } <= 13
311 => Call: Call { closure: A2, args: R2, flags: 0, out: J } <= 16
Oh I do
trust me
I do have a debugger
i meant like in the old code
ah right
to figure out where the transitive break happens
I know how it happens, I just hate it π
though being able to debug new eval is also cool
It feels wrong
I mean the output is cursed but it helps me debug the weird edge cases
cant wait to have the equivalent of python's dump thing
π€ ?
you can extract a function's bytecode within python iirc
ah right
I mean the bytecode is just a big bunch of bytes
>>> dis.dis(myfunc)
2 0 RESUME 0
3 2 LOAD_GLOBAL 1 (NULL + len)
12 LOAD_FAST 0 (alist)
14 CALL 1
22 RETURN_VALUE
cant wait to do the same in Typst
π
π
gn
Itβs essentially the same thing we discussed before. Itβs breaking the inner block, but then still finishing up the expression. The semantics are fairly simple conceptually: Instead of βleave everything right nowβ it is βleave every code/content block early, before starting to evaluate the next expression in it.β
Yes but in this specific case it feels weird to me that it goes through a function call, Iβll try to make that work when I have time to work on it again. Thankfully my new implementation should be able to do it more easily but it wonβt be pretty
We did have that in previous examples, too, where it called the text function, just like it is calling identity here.
Hmm
I donβt know exactly how Iβll handle that but itβll be tricky
Ideally it would arise naturally from you only checking for control flow in between the steps of a block, not in between arbitrary expressions
Same as in (break, 1, 2) which can join with other existing arrays.
Should both fn(break) and (break,) be rejected by bytecode compiler?
π what
π
ok, I seem understand what it says. But I don't know what I can do by this semantics.
It ensures that things join & style properly when having a break nested in code/content blocks with styling in between
We could consider allowing control flow keywords only directly in blocks (like set/show currently are). That wouldn't alleviate the need for the semantics (since you can put a block anywhere), but it would reject more code that is likely wrong.
I try to compare typst with rust, as they both regard blocks as expression (so infers that if, while, for are expression). But I soonly find that typst has join semantics, then we cannot directly borrow benign and undefined behaviors from rust. To leave room to current compiler optimization or that in future, we may reject some strange grammar or mark them as no semantic ensurenesss.
π€ break and continue may should never be expressions but special statements.
I think I "re-"understand it. during execution, we may have a sequence of unfinished expressions on the stack. The semantics of break or continue will try to goto the next position, which will finalize the expressions stack in order. See:
#x(for {
y(while {
z({
u(1 + v(2; 3, break))
})
})
})
There will be a stack of expressions which are waiting for incoming arguments. In the example, the stack is [x, y, z, u, v]. And a break in the expression will finalize v, u, z in order.
Yes that was my initial approach, to add a Breakpoint opcode on where control flow is processed, or just do it always when coming out of a block
My problem is that break isn't a value so it really makes my life terrible
This usecase makes sense to me, but the VM doesn't support it
My problem is that as far as the VM is, an argument is evaluate before the call but "inline" (i.e in the same scope)
so I'll need to add Breakpoints to know when I can safely be done executing, but I don't know how well that'll work
I guess that breakpoints can be added between each expr in a loop?
What do you think @left night
so it would look like
#while true {
text(blue)[
My text is blue.
#break
And not green.
]
breakpoint() // not a real function, just to show the opcode
}
Control flow needs to be processed not only between loop iterations, but also between each individual expression in a block though. Otherwise "And not green" shows up.
so a break does the following:
- In the loop: stop the loop at the next
breakpoint - In a block in the loop: stop the current block, carry the control flow?
I assume somewhere in your code, you join the value of each expression in a block? That's where you also need to check for a flow.
not really because my code actually has a special purpose register which is the "join" register
as far as the VM is concerned, there is no "joining", it just writes to that register
and that register has no access to the other state
With my approach it'll work but it does remove the ability to use a break as a value
so you can break in a scope ([] or {} but not as a regular expression)
which imo is fine
We coul always make break optionally take a value like return?
okay, so this basically? #1176509648707256370 message
@sturdy sequoia I have read your opcodes.rs and opcodes_raw.rs, there is question: the possibility to eliminate element instructions in opcodes_raw.rs by func calls.
π€ why do some elements are in but others are not in the opcode table for construction? E.g. creating math.frac is an inst but creating raw is not.
yes
the markup-based elems are compiled to special instructions
they're not needed per-se, but I think it's nice that it's a single instruction instead of a bunch of them
does it bring benefit by compiling to special instructions?
it should theoretically be faster at the cost of bigger code size
I'd also argue it's easier to maintain
but it's neither here nor there
but would be fine imo
I'm thinking about possibility to split a typst-vm or typst-evaluator, which creates packed content by calling functions. From that perspective, it is better that don't specialize element constructions.
Isn't math.frac also markup-based in that sense?
@left night the idea of adding special flow instruction (previously called breakpoint) works π
The VM will necessarily depend on the built-in types (e.g. Array and Dictionary) and those live in the same crate as the elements. What you propose is similar to the old LangItems setup when typst and typst-library were split. I think in some future we can think about splitting crates more again (differently than last time though), but I wouldn't bother with this here since the old evaluator does it the same way.
@left night it even works with break as a value with one caveat: it counts as a none when placed in an array:
#let x = for i in range(1, 5) {
if i == 3 {
(1, 2, break)
} else {
(1, 2, 3)
}
}
#x
leads to:
is this acceptable?
I can probably fix that, but it would be... really hard
I was witnessing typst gets merged with typst-library as they usually references each other.
It will definitely be splitted again in future, since typst crate is too big to a sensible scale of crate. We should find a good point whatever. And I consider an evaluator/vm.
Was typst a purely evaluator/vm?
That would also need some major changes to the compiler as I have written it with typst in mind
I was going to suggest doing a similar split to allow tools like manim to be built using the typst language (and most of the features) but with custom syntax, etc.
how does it look like? a reference link or a simple example? I have not used manim but I have heard that.
Essentially it allows you to build animated slides with Python, which is used a lot by math content creator and the like, I think Typst is uniquely well suited for such a tool, but it would likely require some custom syntax, and a lot of additions to the current global scope
@left night I think it's quite interesting that because destructuring is compiled ahead of time, things like this are already supported without me even thinking about it: https://github.com/typst/typst/pull/3308
That's how it is intended to work
Awesome π
Then it... just worksβ’οΈ
If I want to compare how my stuff compares to main/release without switching branches and stuff, I just open the web app π
I do too sometimes, but I should be studying
and I am too lazy to procrastinate any more π
I only sort of agree. I would also like to split it again in the future (just in a better way), but there is also a trend in Rust that some large projects that were previously split up into many small crates go back to fewer larger crates. One example is tokio where the core crate is larger than typst.
I think perhaps making typst more "extendable" with the ability to have more dynamic stuff, and operator overloading (i.e defining addition for custom values, etc.) it could also work just fine
custom join too etc.
that way typst as a crate is more extendable
yes, but I think it would be close to impossible to make eval not depend on some base types
and I mean operator overloading from rust
that I agree
Dict, Array, Value, Str, etc. have to be in for it to even work
I think we can potentially have typst-eval depend only on typst-core, which is the bare minimum
bug then again typst-layout would require further base types like Length
the typst-core / typst-whatever split would be fairly arbitrary based on what is needed elsewhere and what isn't
which is frustrating because when you add something new, you might need to move a large amount of code from typst-whatever to typst-core
My understand is we don't split them early, which will cause many additional unnecessary code to tight crates together again. If we have some golden point to split, we would like to split them. For example, typst-syntax is nice as an individual crate and we've used them in typstfmt.
In short they large projects was spliting code early in some bad point, and find unideal thing. They large projects have to merged them together to reduce costs to write additional code. They large projects still don't have some golden point to split so keep a large core crate.
I think for the current speed of iteration, it is good that there is just one main crate. If the major planned features are implemented, we can then think about what a golden split would be.
I also hate the fact that I need to anticipate what subcrate we might need and reserve them on crates.io to be sure that we can use the name in a future-proof way.
Yeah but that's a whole other can of worms
There is this very nice RFC where multiple crates are part of one package
Ah, that's nice
it's like namespaces, but compatible with the current impl π
hot
@sturdy sequoia the PR I opened also gives a bit of a speedup compared to main on my machine. curious if you can reproduce that.
Masterproef evidence or it didn't happen @sturdy sequoia
I'll do that tomorrow π
The show/set one?
Yes
I guess it's tricky to benchmark since you would also have to make sure the output is the same
I haven't looked but I wouldn't expect it to be a lot different
While the PR does make some breaking changes, it mostly affects code that was non-sensical before.
@left night Actually, I have a case where my VM is more permissive and I want your opinion:
this is a test from the suite:
// Count labels.
#let label = <heya>
#let count = counter(label).display()
#let elem(it) = [#box(it) #label]
#elem[hey, there!] #count \
#elem[more here!] #count
I tried modifying it to:
// Count labels.
#let label = <heya>
#let count = counter(label).display()
#let elem(it) = {
box(it)
label
}
#elem[hey, there!] #count \
#elem[more here!] #count
On my VM, this works, on the webapp it's refused
Should it work or should it be refused? π€
It is trivial for me to do the old behaviour, so it's really a design decision, not a constraint in any way shape or form
(literally one line of code)
I kinda like it
Next point of contention, this code:
// Ref: true
// Test continue while destructuring.
// Should output "one = I \ two = II \ one = I".
#for num in (1, 2, 3, 1) {
let (word, roman) = if num == 1 {
("one", "I")
} else if num == 2 {
("two", "II")
} else {
continue
}
[#word = #roman \ ]
}
It works, except that my code still tries to destructure after calling continue since the next flow opcode is right after which makes this not work π€ I can probably special case it though
I can also make destructuring imply a flow to avoid issues
I remember a label attachws to its previous content automatically, but documentation also remind it doesn't work in "code mode".
I think it is a limitation, and your VM extends it.
Essentially my VM forces you into "content" mode (in terms of joining of values) once you try and join a label
what do you mean by it works? that it does not throw an error due to not being able to destructure none?
I admit it might be a bit weird
It works in the sense that the control flow works, but the destructure fails
ah, I think that should be changed then
let (a, b) = if x.len() == 2 {
x
} else {
continue
}
I would expect such a thing to just work
yeah I know, but it's because of the weird control flow scheme
it happens right after destructuring, not before
But as I said, I can just manually append flow opcodes where necessary
like before an assignment
It does seem reasonable, but I'm not 100% sure yet whether it will have unintended consequences and, in particular, whether it will be future compatible with content = value. Is it equivalent to having content + label join into labelled content?
The current let binding code also has a special case to handle this, so I think it should be the same here.
exactly, it basically joins label to the previous content
I did that too, I just generate a flow opcode before calling the destructure opcode
okay. the problem is that when content = value, every element is its own type and to roughly retain the current way things work, the joining rules need to be adapted. basically any two values need to join into a sequence. any ++ label (using ++ for join here) could probably still do labelling as a special case, but I need to think more about this.
Yes, the reason I currently changed the behaviour to the same as main is that it's non-trivial to label sequence elements that are in the joining "value"
do you mean that it would label the sequence rather than the last leaf?
or the other way around?
yes
if you do content ++ content ++ label, does that mean sequence[content, content] ++ label or sequence[content, content ++ label]?
I'd argue it's non-trivial
take the case #f() <this>, where f produces a sequence
yeah, that might be the reason this doesn't exist on main (or nobody ever thought about it, can't remember)
yes I think it's better if that doesn't work as it's less error prone imo
and more consistent
I think supporting syntactically labelling in code mode might be nice (i.e. figure(..) <label>), but arbitrary labelling should perhaps be done via a function. Maybe the same should actually apply in content mode. It's more predictable that way.
It also fixes the problem that people from time to time try = Heading #label and the label attaches to the text rather than the heading
yeah, that's sort of what I meant
this is the kind of thing that is either super easy to parse or a parsing nightmare
if we want to make the breaking change for content mode, we should wait until type rework probably because it'll be quite the breaking change
I'm hoping to batch a bunch of breaking changes up and then make one very breaking release
For the VM I am trying not to break things and so far I am achieving it which makes me happy π
I mean, almost everything is broken
but what works isn't
π
^^
Do you want to know the trickiest bug I have fixed so far?
I was debugging a test that would only fail if it was a test, not as a standalone document. Is so happened, that I was creating a local variable with the name of a closure (to allow recursion) but the closure was supposed to be anonymous, but it was not because I accidentally gave the compiler the module's name (counter) which means that when it did counter(heading) it was re-invoking the module
π
That was a very weird one to debug
oh no
BTW, once the VM actually works, will you and I be able to schedule a call to implement tracing (like seeing execution of nodes, etc.) because I have no idea how it all works and how to integrate it. I'm talking in 2-3 weeks most likely
sure, we can discuss it then! just for clarification: do you mean tracing as in hover tooltips?
yes exactly
@left night outside of doing some bit magic, do you think there is a way that a Span could become align(4) or (ideally) align(2)?
π
I could make it #[repr(packed)] but it's a bit unholy
I think you could store it as [u8; 8] and boom, it's align(1). Casting between that and u64 shouldn't be a problem as long as you do it by value rather than by reference (i.e via as_ne_bytes)
@sturdy sequoia have you compiled your thesis yet?
I can't wait either...π π
No, I have changed a few things Iβll try again today
Well, I had this theory that: less registers = faster VM
I was right 
So one optimization I'll be implementing: several sizes of VMs, and it will select one depending on the number of registers you need (from 4 -> 256)
(so 4, 8, 16, 32, 48, 64, 96, 128, 256)
it's actually surprisingly easy to do with const generics and an enum π
Phew. That rarely work, because const generics are limited.
But it's good that you can make do!
My idea is basically to have a Vm<const N: usize> and then an enum like:
enum VmSizes {
S8(Vm<8>),
S16(Vm<16>),
...
}
easy as it gets
wouldn't that just have the enum be the size of the largest vm?
or are the registers heap allocated?
hmmmm
I didn't think of that!
Probably would need to be a function instead
like
fn eval(...) {
if min_reg <= 8 {
Vm::<8>::eval(...)
} else if ... {
...
}
}
And hope the compiler is smart enough to figure it out
yeah that could work
As it turns out tablex needs more than 64 registers π
I think you need a fallback stack once too many variables are needed at one time
uh, actually not a stack
just a heap allocated (resizable) array
didn't laurmaedje have some funky pattern for this?
probably yeah
I mena there's a good change that just using the heap would work
and having infinite registers
I was just writing that
but yeah
or another idea
you could have ```rs
trait Storage {
fn read(&self, reg: Register) -> Value;
fn write(&self, reg: Register, value: Value);
}
impl<const N: usize> Storage for [u8; N] {}
impl Storage for Box<[u8]> {}
struct Vm<S: Storage> {
registers: S,
}
ah yes, we can go back to heap if need be
that's clever
and small VMs can benefit from the full set of optimizations by having small register arrays
yes but that's a one-time cost: when "entering" the VM
well, not quite
it's the download size of the wasm binary
it's the amount of icache that is taken up by that part of the vm
it's the initial memory accesses that might not be quite as efficient due to code being as close together
but I expect it not to matter too much in the grand scheme of things
true, but executing code will likely still be slower π
You're never going to get to the point of actually compiling actual documents are you? π
I mean I'm trying 
why does it depend a const generic size of registers?
Maybe too hard!
It will generate 10 copies of VM code if you uses Vm<4,8,16,32,48,64,96,128,256>
to optimize its size based on the actual needs
Just be careful so you don't overwork yourself!
Yes!
How do we use the generic constant? If it is for a static-sized register array, there will no difference with an Arc register array or even a Vec register array.
I also don't yet understand why you need to instantiate the Vm over and over again instead of just storing the registers in a Vec and only using a smaller amount of the Vec. I can't imagine that the one allocation will be the costly thing.
I don't see how this is different from 1 + break + 3
#{
[ one ]
[ two ]
label("hi")
}
is query(<hi>).first().elem [two] or [one two]
this is where a never type (and uh... value) might help
a never value, huh ^^
what do you mean?
Well that's the thing, I haven't tested yet, my idea with limiting the number of registers is to (hopefully) guarantee that they're in L1 cache
But I don't know whether this has materialized and I haven't profiled the compiler yet
All I know is that less registers translate to double the performance
By why can't a Vec of registers be in L1 cache?
it can, but since the array would have been on the stack the likelihood of it being in L1 cache should've been higher
The solution that was settled on seems reasonable to me, I've no more to add
Oh that's smart
is that really so? if the vector is constantly being accessed and still not in cache, I would blame the CPU for being dumb.
fair
Anyway, I'll give it a try too, I'd just like the VM to work now π
Dherse's idea is that the registers would share the cache together with the Rust variables
If I understand
There is a weird behaviour that makes tablex not compile and I don't know why
I am just very concerned about code complexity, compile times, and generated code size with heavy use of const generics across the entire eval pipeline.
I mean yes, but the registers are rust variables in a way π€
right now there are no const generics
so not much to worry about
yet
yeah, that's why I'm arguing heaveily against it now
well but if one's on the heap and one's in the stack...
much of the values being operated on will also be in various places of the heap though
I still think that having const generics using @cunning wadi's Storage trait is good for small Vms (i.e that need like < 16 registers) because those are likely very short code, very short lived so not allocating in those ones is probably a good thing
how big is a register?
16 bytes iirc
Value is 32 bytes
really?
damn that's chonky
so registers are 32 bytes π
chonk
but even with a fully reworked value representation, I don't see how to go much lower
to store an integer inline, we need 8 bytes. we need to store type information somewhere (most likely a pointer), that's another 8 bytes. and then we need some pointer to extra optional stuff like labels and locations. so that's 24 bytes min. if we want 16 bytes of inline capacity (which I think we want), we're back to 32 bytes.
16 registers is half a meg
megabyte? half a kilobyte.
we could store just one pointer maybe
the type info pointer could indeed be moved into the extra optional stuff allocation. not sure whether that's worth it as it would add extra branching and potential indirection to all type checks. would need benchmarks.
but in the inline case, we still need at least one pointer in addition to the inline value
and 8 bytes of inline storage would not let a short string be inline, defeating the entire purpose of EcoString (which is 16 bytes)
Technically we could optimise way further by exploiting pointer repr
acheieving more than 24 bytes of inlined string
@left night @glad urchin @sly pecan @tight glade @feral imp I have made TREMENDOUS progress: IT COMPILES TABLEX
π π π π
I ran the whole tablex-test.typ and it all seems correct to me

is it fast?
IT COMPILES MY THESIS
OMG
OMG
OMG
I DID IT
IT WORKS
π
π π π π π π π π π π π π π π π π π π π π π π π π
"no"
not yet, I haven't pushed
oh
ok goddammit do I cancel now?..
yes?
I'll try and figure out why it's so goddamn slow
is compilation slow or the interpreter?
it's the VM
but I know why
(I think at least :D)
Small functions are currently quite slow
Hourra! Sad its so slow tho π

that's actually not a bad idea? π€
Ok, so I did the change
and it's about 20x faster in debug
time to see in release π
It also simplified a ton of stuff which I like
I read this and thought you meant it was 20x faster in debug than release
no π
Wondering what kind of cursed stuff you were doing
I will not, in fact, answer that
Wise.
What you say can and will be used against you.
I plead the fifth?
Time of dynamic number of registers, maybe that'll work π
I will neither confirm nor deny that I murdered that man
WHY IS IT SO SLOOOOOOOOOOW π¦
unsafe { std::transmute((******arc).wrapping_offset(engine.among_us)) }
I don't get that, but I'm sure it's super funny
I don't think you're supposed to get it
It's fast in debug and slow in release?
have you profiled it? what does it say?
I haven't yet
it's faster in debug than it used to be, but in release the difference is much lower
Alright. Progress is progress. There is a lot of experimental / novel shit here, so do take time to just reflect and appreciate that it is growing organically.
I'm sure it's just some dumb stuff that's slowing down everything
I am profiling atm
I did perf. I am no expert in hashing, but goddam, that's a lot of hashing.
I can't find anything not about hashing, actually
ooooooooooooooof
ooooooooooooooooooooooooooffffffffffffffffffffffffffffffffffffffffffffff
I did a poopsy it seems
aha nevermind
what went wrong?
too much hashing
I just need to figure out where the rest is 
Wait, I think I know!!!!
I did not, in fact, know
I was going to say "... and you will first profile it to check if you are right or not?.." but decided not to...
I mean, there is too much hashing
I'm trying to figure out why
May be you should switch from SipHash to another hasher?
For example hashbrown uses AHash as the default hasher, which is much faster than SipHash.
We rely on the quality of SipHash to make the assumption that there are no collisions
You do not need HashDoS resistance inside compiler in any case
we kind of do though...
I think hash collision can be easily abused to extract data from unrelated sources
so, say, a package would be able to access the cached read file
that's bad
plus even if all hashing suddently become something like xor hashing, it would have not helped here
it's an algorithmic issue caused by inaccurate use of comemo (I assume)
actually, I bet you can do even worse, can't you?..
You do not need to worry about collisions. In any case AHash provides the same level of collisions as SipHash.
It just doesn't provide HashDoS resistance
like what if you replaythe changes to a different object?
can you break some of the safety guards somehow?
We do need to care about hash collisions. comemo relies of high quality hashing
High quality hashing and HashDoS resistance is not the same thing.
Actually SipHash just recreate hasher every time.
I know it's not the same thing, but I don't understand what you are saying
I mean that if you compare AHash and SipHash regarding to the collision possibility, so they are pretty the same.
ah, so you are just proposing replacing siphash with ahash?
ahash is 64 bit though
?
look, isn't it just letting Dherse work at it a bit more, to figure out why there are lots of seemingly unnecessary hashings, and somewhere, sometime later it can discussed to change the hashings?
yeah ofc
switching hashes now won't give anything
current VM performance is a bug
Yes. SipHash recreate hasher every time when you change the size of hashmap. When the hashmap grows from 3 to 7 and so on.
AHash at the same time created one time, when you create hashmap, and it stays the same during the life of hashmap
I actually don't look at the code, so may be it can be improved. But AHash really fast. It can be 10 times faster than SipHash, especially when you use simple keys like integers, etc
Is there 128 bit version of AHash? If not, it's basically useless.
64 bits are not enough to prevent collisions
We'd need to talk about this with @left night
Hey, it's now faster than main π
Of courseπ. But you need to remember that even if the output hash doesn't have any collision, they can occur when you calculated the remainder (u64 / buckets). And because current rust hashmap implementation use sse2/neon instructions, you probably dos not notice difference on the small size tables (up to 48 elements, I think).
And AHash passes the full SMHasher test suite (https://github.com/tkaitchuck/aHash/blob/master/compare/readme.md#Quality), so it's output is pretty good
Hashes are not used for hash table directly, they are used for comparison
in other words, there is a chack that hashes are equal, object equality is not checked
moreover, the "rehashing creates a new hasher" argument is notr applicable here afaik, because the hasher for a comemo hash table is separate from a hasher for tracked objects
Hm.. Actually you are wrong.
Hashbrown stores the only first 7 bits of the hash, and all other comparisons are done on real keys, not their hash values
I think there is a misunderstanding going on. I am not talking about hash tables in general.
Here is how comemo works: it has a big hash table for every memoized function call, from the 128bit hash of its arguemnts. Every time the function is called, it hashes it's arguemnts (not quite, but let's pretend it does...) and retrieves the memoized result. The values themselves are not compared.
Oh. May bad π . I think that we talk about the hashmap implementation
yeah, it's ok, everyone makes mistakes
that's why it needs 128bit hashes, 64 bit is just not enough, so a lot of hashers are basically out of question
@sturdy sequoia once you resolve the hashing problem, can you try something? It's probably a bad idea (big 50/50... well... 5/90/5, where 90 is "does nothing"), but at least it's not too hard to check and if I am right I want to be remembered as the best software engineer to ever exist.
In vm/mod.rs, in enter_scope, try boxing the new VM before running vm.run(engine). I have a theory (not confirmed by anything, just an educated guess) that VM is too fat for the stack and it causes worse stack performance because hardware hardware blah blah blah frontend prefetchers L1 stack engine blah blah blah. Of course, preferably the same VM (and same registers) should be used everywhere, but I assume it's not an option.
That's what I was able to infer with my eyes, at least. enter_scope looks the sussiest.
Look into the latest commit π
I mean, VMState is still on the stack, but the registers are on the heap
I'll also try removing the stacker calls
as I have the suspicion that they're really bad π
I have looked at stacker source code for a bit, it doesn't look that bad
it just maps more pages
did it improve anything?
yes, it immensely improved perfs
I also ditched the whole VM thing as it required some unsafe
so the VM is fully safe afaik
omg I am a genious lol
yeah that's basically perfect
and registers are infinite now
kinda, but there is agressive reuse of registers
I don't know if you saw but in the compiler there is a RAII register called RegisterGuard that frees registers for re-use as soon as possible π
Performance of eval has improved by 20% with the commit I'm about to push
eval = the VM in this scenario
my problem now is that function calls are... slooooooooooooooow
Much better than an hour ago
but too lsow
BTW, I just pushed
just function calls, or scopes in general?
calls now
scopes are fine since they no longer involve growing the stack, etc.
I KNOW HOW TO MAKE THE REGISTER TABLE SORT OF DYNAMICALLY SIZED
USE A SmallVec<[Value; 16]>
Makes small function calls cheaper by not allocating!
I AM A GENIUS
(time to test it
)
if only there was a way to not reinitialize registers every funcition call...
hear me out...
ABI
(/s)
I like
I thought about that but expected that you would consider the access overhead too high.
Yeah, there are literally zero gains
I have reached the point where eval is about 5x faster than main
But I was honestly expecting a 10x π¦
I guess eval is not as slow as I thought it was 
And the impact on my thesis (in spite of tablex) is surprisingly low
I assume that's because most of the time is spent calling short functions that aren't much more than glue code
and well, function calls aren't that fast it seems?
inline them
indeed
I am seriously considering it
it's almost trivial with the current design π€
I just don't know how I would deal with arguments (checking them)
but I could only do it when arguments are "simple"
(no spread, etc.)
Ok, I am now working on improving loop performance by removing allocations and dynamic dispatch
Fun fact, using #[repr(packed)] on opcodes does improve performance by a very measurable margin!
what's the current status on how much faster your thesis compiles? π
1 millisecond
I think what would be very interesting just to get a feeling for what is and isn't possible here is to write a reasonably complex evaluation-heavy script in Typst and Python and compare speed. I'd be curious which orders of magnitude of difference we're talking about here.
3% irrc
I'm honestly quite dissapointed 'cause it means that (simple rule of three) eval was 3% of the total execution time
π
Are you still working at it? Or do you think this current version does indeed reflect what's possible with a VM?
There's definitely room for improvement
I think closure calls are wayyyyyyyy too expensive for what they do
I can do this tomorrow
I removed the eval module
and it gives an idea of just how MASSIVE this PR is
πππ
It's a lot. But there is no way that improvement from now on won't entail massive efforts like this....
To be fair, there's also the fact that my thesis is just... not that heavy in the evaluation
I thought it was bigger than that
Could be more in other documents?
Cetz heavy docs must use eval...
most likely
anything with CetZ as @feral imp said
@sturdy sequoia A quite important key for heavy computing. Would you like to do scalar specialization? I mean you stores floats in 32 bytes registers but they could only be 8 bytes, and importantly you may do Value::Float::add/mul/sub/div per instruction, which is very expensive.
I don't have type information π
you can do it at runtime
since typst doesn't have runtime, all of them are at compile time.
hmm
I wonder if that would really help
I just point out why a such vm doesn't quite help the heavy cetz docs.
maybe you're right, I just don't know how I'd do it π€
I had to manually rebase 47 times in a row for some reason

The VM was broken by that
π
did you not keep a backup branch?
π
Git stores a rev pre rebase for you.
Search git reflog. All actions are logged there, and as long as you do not git gc, git keeps "backup" references for you.
π€ I think this is a JIT way to get native code performance in VM, which is very hard. But there would be a possibly way that we embed wasmtime and leverage wasmtime to do JIT by emitting wasm code to wasmtime.
π± @sturdy sequoia Forget the special cases. VM in parallel with rayon should benefit to both local typst and web typst. That may also beat main typst.
@proven umbra it's funny but my VM is finding bugs in CetZ
With some variables being used that don't exist π
I believe align can quite benefit from jit.
align?
Anyway, I'm off to bed
gn everybody β€οΈ
In the picture
ah right
gn
makes sense
oh-oh @sturdy sequoia
hopefully it's reproducible
oh wait, it's just stack overflow, false alarm
just need a recursion limit and it's all should be ok
yeah it does
@sturdy sequoia ok a real bug this time, patterns without let don't compile
#let (x, y) = (1, 2)
#(y, x) = (x, y)
probably covered by tests but still want to make sure
Oh no... typst is soo slooooowwww..... ππππππ
I have disabled memoization for closure::eval to make sure it won't just get OOM'ed..
and it crunches rays for like an hour and 20 minutes now
and python just does the whole thing in less than a minute
ππ
Lemme check main
oh it finished
main is fast?
vm with no closure memoization is 1:22:xx
main will OOM
python is like... a minute... haven't measured yet, it's pretty fast
I will check for a smaller resolution now
oh interesting
VM is slower, apparently
64.14 secs - VM, 34.24- main, and coughs 2.61 - python
Python is a low bar too
compiling VM again with no my hacks to make sure it's not me
well... tbf, I would be impressed if typst got to like 1/3 of python
because duh, everyone says python is slow, but it doesn't mean it's not optimized
it does have a good untyped VM
actually, my raytracer may rely on comemo in one place, so I assume it may not count...
but also...
python does not have comemo at all, so it's still bad
damn, my code is so shit I can't even fix it properly...
the render itself is pretty dope though
ok, VM 39.35
yeah, memoization helps in one place...
God I love that you're here..
really? π±
W-A-T????
π±
where did you add memoization?
the evaluation of the closure
I previously removed it
it would not have finished otherwise
Guess I'll do just that
missa sad
@atomic violet are there lots of small closures?
Because I have the suspicion that freaking closures are too goddamn slow
so, generally, I tried to minimize them
and I did pretend to be a guy who tries to optimize a lot in my code
(without measuring anything)
that's how we do π
and there are a bunch of small functions, but I try to not call them in hot spots
there is a lot of spreading
and destructuring
that's basically the main way of moving data around there
and as a consequence a lot of functions have a lot of parameters
I have the suspicion that I implemented spreading π©-ily while trying to make it faster π
I'm unsure whether destructuring is any fast atm
The hottest functions should be m33mul, intersect, cast-ray and cast-ray-through-transparent, and a few closures for rendering heart, typst guy, and tiled floor. Maybe it would be useful to run a sampling profiler on typst and on python to compare where the slowness is likely occur
I'll run it on my end and see what's going on
@left night I'm having a pretty big issue, for some reason, clalling Route::contains causes a stack overflow π
Do you know what could be causing this?
Here is an example from a stack trace
is goes on and on and on and on
I think I figured it out!
The check for cyclic evaluation was inside of a memoized function instead of the top level and that was the issue, maybe? π€
Yes indeed, that was it now it just panics π
I had forgotten to check for cyclic import which is now fixed
A check for cyclic evaluation inside of a memoized function should be fine. That's the whole point of the Route being Tracked. The check is recursive itself.
yes but here it was actually recursively importing
but it wasn't actually detected
because it was only being done in the compilation stage
ah okay
it was a bit weird, but there are still a few things I haven't ported over
Regarding performance improvements, I wonder how much of the time is actually spent on walking the AST etc (where a instruction-based VM should be faster) and how much is spent on "runtime" things like ops::join, inside functions operating on Values etc.
that's probably the issue indeed
this is where having types would be useful because I could pre-allocate a lot
We should probably figure out a way to profile that more precisely
yes, but adding #[time] everywhere would be ungodly π
I fear that time would have too much overhead, too
@atomic violet on my end, your raytracer takes 18s on the VM, I'll try on main now
yes, it's handly to get a rough idea, but not good for fine grained stuff
Performance counters!
well, it's not that simple tbf π€
Maybe crunch it though callgrind? Count instructions instead of actual time?
You could have, just for some quick tests, a cheaper version of #[time], that measures just one thing at once and its global effect: Effectively, a global mutable static that is increased by the time spent in a single leaf operation every time in runs. And then compare that with the full time spent. This might work better than a flamegraph at showing the full effect of many small leaf calls.
I am not sure how much overhead Instant::now() incurs, but it would be easy to check whether this kind of instrumentation increases the runtime significantly.
Instant::now is super slow, but if we accept having a linux only version in the CLI (gated with a platform flag) we could do that super cheaply
linux and macos *
I agree, it has a specific name too in the nomenclature but I forgot what it is
and time could just generate a per-function counter
I was more thinking about annotating a function of interest with a special attribute just temporarily
Rather than actually committing that to the code base
ah right
So it doesn't matter whether it's linux/mac only
At least for starters
looks good
poor Typst guy's eyes
I'll maybe try that
ikr
with main it runs in 9.7s versus 19s on the VM
π
you are running this right?
I made a bug in that one
What hints to me that the VM is doing too much work is the size of the timings files
with main they're about half as big
but the VM shouldn't be doing any more timed operations
π€
which makes me think it's somehow running the code... twice?
so was this the final verdict? Python is 13x faster?
Let's see on that version
because, if yes, I would be pretty happy with that
2.14 python, 29.84 VM, 24.75 - main 0.10
I am a dum dum
I was calling 0.10 "main" this whole time π€¦ββοΈ
so yeah, about that
How about main main?
one sec lemme build it
It may be joining?..
hmmm
because that's the only place where content is joined iirc
that's makes sense
main main is 16.74 π
oh wait
usr time: main - 14.83 vs 0.10 - 16.87
wth was it doing for 7 seconds?...
so Python is just 8x faster? π
seems like it
if that's true then typst is doing pretty good.
I love how the comparison is to latex, but this is a true behemoth to measure up to.
I would need a proper measuring procedure so name a standard deviation ofc, but the order seems to be about ~6-9x
I am NOT rewriting that shit to latex
that's honestly much faster than I would have expected
What does 0.10 spends 7 seconds of system time doing? Writing a pdf?
and its so consistent as well: 7.09 - 7.5 seconds
@atomic violet so I think you're probably right, it's probably joining that's so goddamn slow
Ok I do think that looping was abnormally slow
I think I just fixed it
time to see
it improved by like... 100ms
π
is the python code doing the exact same thing though?
the result is identical
it should do the exact same thing
I'm not talking about the result
unless github copilot translated it wrong and I missed it
Ok so it's not joining that is slow
Than I have no fucking idea what is slow
π
and considering that even unused variables got translated, it should be the same
it must be function calls?
Take a break my dude. You can come back to it later
Just to make sure you are running in release mode right? π

OF COURSE
So, I have narrowed it down to function calls
Now what I don't know is if it's native function calls
(i.e in the global scope)
or VM-ed function calls
which I assume are at fault

wait no
function calls are fast
Ok so somehow reading values from the VM causes hashing
I have no idea why
So, as it turns out, it appears that accessing global values is the slow bit
is it easy to fix?
I don't know yet
For some reason it hashes like crazy
but even my attempts at improving it don't help

ok so accesses are indeed the slow part
for some reason

let me callgrind it
btw, I don't know how comemo works, but if there is a function like
#let f(big-arr) = {
big-arr.at(0) = big-arr.at(1)
big-arr
}
how is it memoized exactly? What are the constraints here?
I mean did identify the slow part of the code
I just... don't know why it's slow
it's access.rs in the typst::vm
I did at a typst_macros::time to it and it's indeed extremely slow
but I really don't know why
I've tried to bring in some optimizations to no avail
I think I'll need to rework the entire Access system because it's just bad
but I don't know how other VMs handle mutable field access
if anybody knows I'm all ears
just naive caching with copy-on-write. so it will do full clones here because there are multiple references.
it would be great if it could somehow optimize that of course, but it doesn't atm
yeah it will be hard to optimize
I think it's either comemo just doesn't memoize this, or it has some sort of "reversible constraints" or something...
to optimize you need to profile, to profile you need debug info, to have debug info you need to not optimize
literally the worst part of software engineering
While I welcome memes, it seems like a great indicator, that you guys should take a break. If you continue, I'd highly advise outlining low hanging fruit (even if there isn't one!) and going for that. Memes are a path way to burnout π
But I'm just a goose anyways...
dherse should, I am have not done anything yet lol, I am just trying to find suspicios hashes but the <cycle 3> ruins my attempts π
it's done by accessing fields
you can find that in access.rs somewhere in there there 's a get_field or something called
that's what's doing it
afaik
@sturdy sequoia What is foundations::Func::method?
it looks like it prehashes hella lot
and it is invoked in Access::read
π€
