#How can I optimize my code further?

404 messages · Page 1 of 1 (latest)

zenith pike
#

u128 is surely excessive here, even u64 is too big to iterate over every value, and u128 operations are definitely more expensive than u64 as those are software-emulated

tidal fractal
#

what is the code meant to do

#

like, what's your intent

#
fn main() {
    let mut s = String::new();

    for mut i in 0_u128.. {
        s.clear();
        while i > 0 {
            s.push(char::from((i % 94) as u8 + 32));
            i /= 94;
        }
        println!("{}", s);
    }
}

how's this

#

yes, which made it slower

#

... are you not benchmarking this?

pure root
#

Why are you using tokio here @misty vector ?

#

Look into rayon

#

Tokio is giving you nothing here

tidal fractal
#

rayon probably wont' help much either because stdout has a mutex on it

#

and i highly doubt anything but the printing is the expensive part

pure root
#

Yeah very true, for the code you've sent it'd be very fast just as a normal iterator

#

Tokio runs async code across multiple threads, but your code is not async at all

#

Is this just a thought exercise, or is your code noticeably slow and you'd like to improve the speed?

#

If your number of iterations is high enough. You should benchmark this kind of thing for sure

#

Building the string is going to be super fast compared to the println

#

For lower iterations just a single threaded iteration would be faster

#

How are you measuring it to be "insanely slow"?

#

It takes a second for each println?

#

Your biggest bottleneck is just the println

#

But also make sure you're running it in release mode

#

Just printing to stdout is much slower than any of the computations your code is doing

#

I'm not sure what help i can give, since i don't really know your ultimate goal

#

You're pretty much benchmarking printing to std out using your eyes as a measurement

#

Let start = Instant::now()
...
let ns = start.elapsed().as_nanos();
Print ns

candid ether
#

The workaround is simple: Don't print

#

That's all there is to it

#

The alternatives are:

  • Print less, e.g. only every 1000th iteration
  • Get something that can dump to files really fast
#

Also to come back to this:
If your loop could run at 10GHz (completely unrealistic), a u64 would overflow after 58 years of runtime. You don't plan to run this for 100s of years, do you? No? Then maybe use an u64, it's enough

plush wave
#

edit: woops, I misread something, sorry for the ping

rapid wadi
#

i ran a flamegraph on that code (literally just cargo flamegraph > /dev/null)

#

i use flamegraphs for like 80% of my optimization work, generally, they're great

#

i factored stdout locking outside the loop, which surprisingly hardly improved performance; imma try replacing the format machinery next

#

circumventing format machinery also hardly improves performance (it's visible though)

#

flamegraph looks like this now

#

(i changed the loop signature to for mut i in 0u128..1000000 { btw to have a finite runtime to compare with)

#

replacing String with Vec<u8> doesnt change performance whatsoever, surprinslgy

#

imma try replacing the Vec with an array

#

that brought down runtime by another ~3%

#

hmm what now

#

oh of course

#

buffering

#

how did i only think of that now

#

LMAO

#

0.7s to 0.05s

#

buffering is overpowered

#

i 100x'ed the loop limit and it takes just 0.99s

#

ill play around with the buffer size a bit (i started with my go-to size of 50KB)

#

hm doesnt change very much, ill go with 1MB, it hovers around 0.97s

#

kinda sad i only tried the buffering thing now; the buffering issue presumably overshadowed the real improvements by all the ideas i had before

#

anyways heres the flamegraph now

#

around half of runtime is copying bytes into libc; that's kind of a dead end (if you don't venture into obscure kernel features)

#

the other half is everything that's not function calls, so, presumably the straight math done in the main function

#

we can try to calculate two digits per loop cycle

#

by dividing through 94² instead of 94

#

before i do that ill check the assembly of the core loop

#

?godbolt ```rust
pub fn foo(buf: &mut [u8], mut n: u64) {
let mut i = buf.len() - 1;
while n > 0 {
i -= 1;
if i >= buf.len() {
unsafe { std::hint::unreachable_unchecked() }
}
buf[i] = (n % 94) as u8 + 32;
n /= 94;
}
}

wet mirageBOT
#
example::foo:
        test    rdx, rdx
        je      .LBB0_3
        mov     rcx, rdx
        add     rsi, -2
        movabs  r8, 6279742663390485657
.LBB0_2:
        mov     rax, rcx
        shr     rax
        mul     r8
        shr     rdx, 4
        imul    eax, edx, 94
        mov     r9d, ecx
        sub     r9d, eax
        add     r9b, 32
        mov     byte ptr [rdi + rsi], r9b
        dec     rsi
        cmp     rcx, 94
        mov     rcx, rdx
        jae     .LBB0_2
.LBB0_3:
        ret
```Note: only public functions (`pub fn`) are shown
rapid wadi
#

(i changed it to u64 because with u128 it just called into a stdlib function which doesn't optimize nicely; and i added the unreachable_unchecked to add the context which the full program probably provides, and to remove clutter from the asm)

#

ok i cant tell much from the asm

#

nice, unrolling the loop to two digits per cycle actually helped rust let mut i = buf.len() - 1; while n > 0 { i -= 2; buf[i + 1] = (n % 94) as u8 + 32; n /= 94; buf[i] = (n % 94) as u8 + 32; n /= 94; } if buf[i] == 32 { i += 1; }

#

~0.88s -> ~0.84s

#

scaling up to 3 digits per cycle made it much worse (>0.9s)

#

changing u64 down to u32 chops off 15% of runtime but i cant get away with that because it would overflow in like a few dozen seconds

#

hmm one idea i still have is a lookup table

#

mapping n % 94² to the two ascii digits it produces

#

that would be slightly messy to implement

#
#![allow(unused_must_use)]
use std::io::Write as _;

fn main() {
    let stdout = std::io::stdout();
    let stdout = stdout.lock();
    let mut stdout = std::io::BufWriter::with_capacity(1000000, stdout);

    let mut buf = [0; 21]; // 20 characters + newline to fit all base-94 u128's
    buf[buf.len() - 1] = 10; // newline
    for mut n in 0u64..100000000 {
        let mut i = buf.len() - 1;
        while n > 0 {
            i -= 2;
            buf[i + 1] = (n % 94) as u8 + 32;
            n /= 94;

            buf[i] = (n % 94) as u8 + 32;
            n /= 94;
        }
        if buf[i] == 32 {
            i += 1;
        }
        stdout.write_all(&buf[i..]);
    }
}
#

i changed it because u128 were strictly useless for this specific program

#

sure

rapid wadi
#

but for benchmarking, you kinda need a specific piece of code to compare

#

and this specific piece of code just never reached the number ranges that u128 supports

#

u64 holds up to 2^64-1, u128 holds up to 2^128-1

#

?play ```rust
dbg!(u64::MAX);
dbg!(u128::MAX);

wet mirageBOT
#
[src/main.rs:2] u64::MAX = 18446744073709551615
[src/main.rs:3] u128::MAX = 340282366920938463463374607431768211455```
rapid wadi
#

length of what characters?

#

ah ok

#

when i replace the loop header with for mut n in [u64::MAX as u128, u128::MAX] {, it prints ```rust
@22>%,ipPg
+"G}PV:g:h%6edu))p+-

#

so that's 10 and 20

#

sure we can do u128

#

what im saying with u64 is u128 is, currently the code just loops upwards from 0

#

and when doing literally that, you'll never reach generated strings above 10 chars

#

so doing hardcore optimization and benchmarking on u128 is kinda pointless

#

we'd need a more realistic benchmark

#

for example one that samples n uniformly-ish from the whole u128 range, not just counts upwards

#

the latter

#

also, i unrolled the inner loop, not the outer loop

#

i made it convert two digits at a time, not to numbers at a time

#

where

#

ok i can explain both

#

this is part of a more efficient printing system

#

wait actually, it was just an oversight by me

#

i assumed we're converting to base94 big endian (i.e. the most significant base94 digit coming first)

#

hm okay

#

i would guess it's pretty similarly fast

#

well anyways the reason i did - instead of + because i start at the very right, at the least significant, and then write the digits towards the left, going ever more significant

#

because i thought we should be most significant first

#

but you're right, it should have been +, i.e. going more significant digits to the right, because that's what the original code did

#

thats part of the unrolling

#

for example if we wanna convert 123456 to base94, the code would convert the first two digits to 34 and 91, and then the next two to 13 and 0

#

but now you see we overshot by one digit

#

we should've stopped before the 0 because we were already done at that point, but we couldn't stop because we did 2 digits at a time

zenith pike
# rapid wadi ~0.88s -> ~0.84s

lol you must be on linux
I'm on Windows and, not sure how close it is but after several minutes it's still at 4 letters

#

I took a stab at writing it faster too but I can't get it to finish because of printing

rapid wadi
#

so with ```rust
if buf[i] == 32 {
i += 1;
}

rapid wadi
#

i benchmarked everything while routing to /dev/null

zenith pike
#

Aha, makes sense

zenith pike
rapid wadi
#

32 is a space iirc

#

but yeah

#

probably shouldve

rapid wadi
#

what is crunch?

#

what do you mean "know it's 4 digits"?

zenith pike
#

If 0123456789abcdef are the characters then your program is not behaving the same way?

#

I'm getting stuff like

!:E?!
!:E?"
!:E?#
!:E?$
!:E?%
!:E?&
!:E?'
!:E?(
!:E?)
!:E?*
!:E?+
!:E?,
!:E?-
!:E?.
!:E?/```
#

could explain the speed difference, this is doing a lot more work

rapid wadi
#

you mean know when the program should stop?

#

that's dictated by the loop condition at the top

#

in my current code, the loop just stops at 100000000

#

~0.85s

#

there is no mutex right now

#

i dont know

#

it may

#

it may not

#

i got another optimization idea, from the newly added context

#

of those i only expect codegen-units to be able to make a difference in this case

#

ill try

#

mayybe

#

although, not really actually

#

we are only using stdlib

#

and i dont think lto works across stdlib if you use the default precompiled stldib

#

i see no clear difference

#

not really

#

the only significant stdlib call here is write_all

#

and its actually marked #[inline]

#

so it can be inlined even if lto is off, if i remember correctly

#

as far as I remember, setting lto = true enables inlining functions that aren't marked inline

#

but all relevant functions in this code are already marked inline (write_all, and slice.len() i suppose)

rapid wadi
#

either naively (i.e. loop through, least significant to most significant, keeping track of carry), which could also already be faster tbh

#

or smartly, by taking advantage of native integer addition and carry (something that the 31GB/s FizzBuzz impl also did iirc)

#

not really, it's probably a decent chunk of work to get all working

#

we probably could.. :p

#

before attempting, it would be a wise idea to formalize the requirements though

#

for example if it's ok to start with leading zeroes, and/or if it's ok to hardcode to a certain character length, the programming would get much easier and probably faster

#

ok so the output must be identical to your original posted code EXCEPT the base94 numbers can also be big endian instead of little endian? @misty vector

rapid wadi
#

no updates

#

i havent started

#

no promises i will

#

im just here for fun ^^

#

why there?

#

interesting

#

i mean if you want i can do some coding and explain

#

idk how far ill get

#

although that would be easier in a voice channel imo

#

thats ok

#

i was too, for the longest time

#

still am a bit but much less

#

@balmy bison

#

ok

rapid wadi
#

(we continued this conversation in voice channel, and then DMs)

#

current state, if anyone is interested

#

without the stripping, the output is 1.6GB/s

#

at this rate, it will take roughly 370 days until the u64 accumulator is overflowed

#

thats enough for my standards

rapid wadi
#

how are you measuring, and how are you invoking crunch?

#

prooobably not, because in the end the data still needs to be outputted sequentially

#

at least if we're using stdout?

#

see the command visible in the screenshot

#

when i do that, it has quite some delay

#

what are you doing about that

#

oh

#

and youre usin literally 1 4 as arguments?

#

bc that doesn't do the same thing as our rust code

#

crunch seems to default to a-z as the charset

#

and 1 4 prints only 475254 lines

#

crunch 1 4 prints 2357264 bytes, taking 3s + 0.085s for me

#

that's 28 MB/s

#

lol

#

yeah

#

yeah

#

it would probably be a tiny bit slower with big endian

#

the space stripping can probably be faster

#

and we can probably calculate more per loop cycle

#

for example we can have an array of 8 accumulators representing the first 8 strings. In each loop cycle we'd batch increment them all by 8 and and roll around them all, to get the next 8 strings

#

the compiler will probably help us with autovectorization

rapid wadi
#

sorry just arrived home

#

yeah

#

it keeps track of how long the generated strings get

#

checkpoint is the index where the string length is gonna get one bigger

#

that is the question

#

find some way to not have to compare i to checkpoint every itreation

#

or make the check less expensive

#

the length doesnt increase every 93 iterations

#

not even every 94

#

it grows exponentiallyy

#

think about base10

#

there's 9 single digit numbers (1-9)

#

then 90 two digit numbers (10-99)

#

then 900 three digit numbers (100-999)

#

and so on

#

i dont know

#

im not the computer

#

ask the computer :p

#

why?

#

relative timings should be consistent

#

you can measure the old one and then the new one

#

itll be interesting to see if your baseline is faster or slower than mine ^^

#

cpu compare time

#

your cpu has a higher passmark score than mine

#

although my single thread rating is slightly higher

#

which is what counts here

#

so lets see

#

share your troubles

#

you always need to leave so suddenly

#

ok i made the batch thing

#

sweet spot seems to be 8 x u64

#

but even with that, i get only ~1.5% improvement (using target-cpu=native)

#

so thats kinda sad

#

?godbolt flags="-C opt-level=3 -C target-cpu=native" ```rust
pub fn increment(acc: &mut [u64; 8]) {
for acc in acc {
*acc += 8;
*acc += ((!*acc & 0x8080808080808080) >> 7) * 0xA2;
}
}

wet mirageBOT
#
.LCPI0_0:
        .quad   -9
.LCPI0_1:
        .quad   72340172838076673
.LCPI0_2:
        .quad   162
.LCPI0_3:
        .quad   8
example::increment:
        vpbroadcastq    ymm0, qword ptr [rip + .LCPI0_0]
        vmovdqu ymm1, ymmword ptr [rdi]
        vpbroadcastq    ymm4, qword ptr [rip + .LCPI0_1]
        vpbroadcastq    ymm5, qword ptr [rip + .LCPI0_2]
        vmovdqu ymm2, ymmword ptr [rdi + 32]
        vpbroadcastq    ymm7, qword ptr [rip + .LCPI0_3]
        vpsubq  ymm3, ymm0, ymm1
        vpsubq  ymm0, ymm0, ymm2
        vpaddq  ymm1, ymm1, ymm7
        vpaddq  ymm2, ymm2, ymm7
        vpsrlq  ymm3, ymm3, 7
        vpsrlq  ymm0, ymm0, 7
        vpand   ymm3, ymm3, ymm4
        vpand   ymm0, ymm0, ymm4
        vpmuludq        ymm6, ymm3, ymm5
        vpsrlq  ymm3, ymm3, 32
        vpmuludq        ymm3, ymm3, ymm5
        vpsllq  ymm3, ymm3, 32
        vpor    ymm3, ymm6, ymm3
        vpaddq  ymm1, ymm1, ymm3
        vpmuludq        ymm3, ymm0, ymm5
        vpsrlq  ymm0, ymm0, 32
        vpmuludq        ymm0, ymm0, ymm5
        vmovdqu ymmword ptr [rdi], ymm1
        vpsllq  ymm0, ymm0, 32
        vpor    ymm0, ymm3, ymm0
        vpaddq  ymm0, ymm2, ymm0
        vmovdqu ymmword ptr [rdi + 32], ymm0
        vzeroupper
        ret
```Note: only public functions (`pub fn`) are shown
rapid wadi
#

considering how vectorized the assembly looks

rapid wadi
#

This is where i got the idea of that carry trick from btw

twilit thunder
#

500+ chars?

#

idk i’m probably getting something wrong

#

but you do not need to use 128u for 500+ chars

#

i’m probably just being stupid

rapid wadi
#

500+ chars in the base94 encoding of the number

#

although u128 would not be enough for that either lol

#

that's more like u4096

twilit thunder
rapid wadi
#

the encoded version sure

#

but the source number

#

also needs to be stored

#

anyhow

#

since the task is to count upwards, even 20+ chars in the base94 encoding will never be reached

#

@misty vector btw i have another idea for quicker IO

#

Linux vmsplice syscall

#

its a rabbit hole i already tried to venture into, once, but didnt really manage

rapid wadi
#

Windows doesn't have an equivalent afaik

#

Bad luck for windows

#

The compiler generates simd instructions automatically

#

That's called auto vectorization

#

The code needs to be suitable for it

#

Like in the godbolt code i posted above

wet mirageBOT
#
warning: function `main` is never used
 --> <source>:3:4
  |
3 | fn main() {
  |    ^^^^
  |
  = note: `#[warn(dead_code)]` on by default

warning: 1 warning emitted
rapid wadi
#
fn main() {
    let mut stdout = BufWriter::with_capacity(1000000, stdout().lock());

    const BATCH_SIZE: usize = 8;
    let mut acc: [u64; BATCH_SIZE] = std::array::from_fn(|i| 0xA2A2A2A2A2A2A2A2_u64 + i as u64);
    let mut i = 0;
    while i < 100000000_u64 {
        for &acc in &acc {
            let _ = stdout.write_all(&(acc - 0x8282828282828282).to_le_bytes());
            let _ = stdout.write_all(&[10]);
        }

        for acc in &mut acc {
            *acc += BATCH_SIZE as u64;
            *acc += ((!*acc & 0x8080808080808080) >> 7) * 0xA2;
        }
        i += BATCH_SIZE as u64;
    }
}
#

possibly

#

but that much lowers the chance of the compiler recognizing the autovectorizable code

#

think like a compiler

#

what SIMD is, apply the same math instruction on multiple values at once, basically

#

when the compiler sees this, it's a very easy deduction that those loop iterations can be done in parallel using SIMD instructions

#

however, if those math statements were interleaved with stdout.write_all(), the compiler would first have to undo the interleaving and prove that it is allowed to batch the math instructions together without changing behavior

#

so we precisely do not want to combine the for's

#

to incentivize the autovectorization

#

because i removed the space stripping again

#

because my sense of aesthethic hates its presence in the code

#

you can add the previous code back

#

that keeps track of the length

#

why btw

#

is this not all for fun to beat crunch?

#

i suppose

#

altho i would argue crunch has a slightly different model

#

it doesnt have dynamic lengths

#

you can tell it fixed lengths that it will all go through

#

you just said we have to do the same behavior of crunch

#

now you're implying we have to do better behavior than crunch? :p

#

what?

#

is ammount from your code?

zenith pike
#

Is this for cracking passwords? 😅

#

Just wondering why it was relevant that most people have passwords longer than 10 characters

rapid wadi
#

better = worse in every single way, but faster

#

cant configure it at all

#

everything hardcoded

#

incomprehensible code

#

no help page

#

lol

#

with what even

#

i often need to puzzle together a lot of missing context from your messages

#

you need help with.. itself?

#

what do you mean

#

uh

#

is that a question?

#

is it a statement?

#

oh no

#

stop it

#

ripping out meaning of code by evaluating it manually is the compilers job, not yours :(

#

anyways

#

didnt mean to insult you, it's just kinda frustrating xD

rapid wadi
#

we're batch cycling through 8 strings at a time now

#

so at one point, we are at the 88th string, and the next cycle we're at the 96th string

#

and we skipped the 94th

#

where len would have had to be incremented

#

that ofc a bit rip

#

id restructure the whole code a bit

#

to work more like the crunch model

#

i.e. not have dynamic lengths

#

instead, have a single fixed length

#

but then just run that entire fixed length procedure multiple times, for every length we want

#

yea

#

change batch size

#

to.. 2

#

everything else, 94 is not divisible by

#

mh

#

i suppose

#

you could stop the batching right before reaching the 94

#

switch to a slow path

#

after passing the boundary, switch to fast path again

#

but thats really complciated

#

btw in my benchmarks, 2 was literally slower than 1

#

and batch size 1 is the same as not batching at all

#

🙃

#

the slow path is slow, indeed

#

but the slow path would only be executed for the numbers from 88 to 96, in this example

#

because we need to go one by one in order to catch the 94 threshold

#

all in all, most numbers are still executed with the fast path

#

yeah

#

exactly

#

thats why we should change structure

#

says who

#

its a mii mario

#

to me, static feels better

#

also feels better fitting to the actual task, generating wordlists

#

well in the end the two approaches are equivalent

#

both can do what the other can

#

they are just structured differently

#

and the static one works better in this case i think

#

what

#

please make your question easier to unerstand

#

my understanding aint understandingin

#

what do you wanna do

#

what is there to make work

wet mirageBOT
#
std::io::buffered::bufwriter::BufWriter<W>::write_all_cold:
        push    r15
        push    r14
        push    r12
        push    rbx
        push    rax
        mov     r14, rdx
        mov     r15, rsi
        mov     rbx, rdi
        mov     rax, qword ptr [rdi + 8]
        mov     rcx, rax
        sub     rcx, qword ptr [rdi + 16]
        cmp     rcx, rdx
        jae     .LBB0_3
        mov     rdi, rbx
        call    std::io::buffered::bufwriter::BufWriter<W>::flush_buf
        test    rax, rax
        jne     .LBB0_6
        mov     rax, qword ptr [rbx + 8]
.LBB0_3:
        cmp     rax, r14
        jbe     .LBB0_5
        mov     r12, qword ptr [rbx + 16]
        mov     rdi, qword ptr [rbx]
        add     rdi, r12
        mov     rsi, r15
        mov     rdx, r14
        call    qword ptr [rip + memcpy@GOTPCREL]
        add     r12, r14
        mov     qword ptr [rbx + 16], r12
        xor     eax, eax
        jmp     .LBB0_6
.LBB0_5:
        mov     byte ptr [rbx + 24], 1
        lea     rdi, [rbx + 32]
        mov     rsi, r15
        mov     rdx, r14
        call    qword ptr [rip + <std::io::stdio::StdoutLock as std::io::Write>::write_all@GOTPCREL]
        mov     byte ptr [rbx + 24], 0
.LBB0_6:
        add     rsp, 8
        pop     rbx
        pop     r12
        pop     r14
```Note: only public functions (`pub fn`) are shown
Output too large. Godbolt link: <https://godbolt.org/z/6vjhh134G>
#
Crates:
  /crate       Lookup crates on crates.io
  /doc         Lookup documentation
Godbolt:
  ?godbolt     View assembly using Godbolt
  ?mca         Run performance analysis using llvm-mca
  ?llvmir      View LLVM IR using Godbolt
  /targets     Lists all available godbolt rustc targets
Utilities:
  /go          Evaluates Go code
  /source      Links to the bot GitHub repo
  /help        Show this menu
  /uptime      Tells you how long the bot has been up for
  /cleanup     Deletes the bot's messages for cleanup
  /ban         Bans another person
  /selftimeout Self-timeout yourself.
Modmail:
  /modmail     Send a private message to the moderators of the server.
Playground:
  ?play        Compile and run Rust code in a playground
  ?eval        Evaluate a single Rust expression
  ?miri        Run code and detect undefined behavior using Miri
  ?expand      Expand macros to their raw desugared form
  ?clippy      Catch common mistakes using the Clippy linter
  ?fmt         Format code using rustfmt
  ?microbench  Benchmark small snippets of code
  ?procmacro   Compile and use a procedural macro

You can still use all commands with `?`, even if it says `/` above.
Type ?help command for more info on a command.
You can edit your message to the bot and the bot will edit its response.
#
thread 'main' panicked at src/main.rs:10:80:
range end index 17 out of range for slice of length 16
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

!
"
#
$
%
&
'
(
)
*
+
,
-
.
/
0
1
2
3
4
5
6
7
8
9
:
;
<
=
>
?
@
A
B
C
D
E
F
G
H```Output too large. Playground link: <https://play.rust-lang.org/?version=nightly&mode=debug&edition=2021&gist=8309171b7f9625f7f44536aa49f6e5a8>
rapid wadi
#

is this an attempt at the static structure?

#

if so, the loop header should be more like for _ in 94.pow(len), since if a single digit has 94 possible values, n digits have 94^n possible values

#

ah ok

#

i forgot what amount is

#

btw the word amount has just a single m

rapid wadi
#

maybe you remember a certain way

#

that i mentioned

#

that includes a word that starts with s and ends with tatic

#

no, that would be super tatic

rapid wadi
#

for any given length, it would generate strings of that length, and those would just not contain spaces

#

in the code, there'd be [..len]

#

cutting off the acc

#

imma be honest, tough luck then

#

i feel like the dynamic thing is kinda doomed

#

the problem is, like

#

there needs to be a check on every loop iteration

#

yk

#

if the length can dynamically change

#

i would guess so

#

yes

#

look, the pseudocode is:

#

dynamic ```
for every value {
generate string
find how many spaces to remove, and remove em
}

static ```
for every string len {
  for every value {
    generate string
    remove spaces (we know how many, because we know len
  }
}
#

i dont care tbh, i just say what i think will have better performance / lend itself to better performance in the future

#

i will venture to bed, maybe ill even stay there for the night, so you might not hear anything from me in the next 8 hours or so

twin moon
#

what does this do?

rapid wadi
#

not a particularly helpful answer tbh
this thread is gigantic

rapid wadi
#

i think a baby got into your computer

#

NONBLOCK is a recipe for disaster

#

in this case

#

probably because it's overwriting buffers that still have to be read, one way or another

#

NONBLOCK makes it inevitable to overwrite pending buffers

#

but even without it, it's still easy to accidentally do it

#

im not sure, but i believe the way to go is:

#
  • decide on a certain buffer size N
  • set the linux pipe buffer size to N*2
  • create two [u8; N] in the program
  • fill them and expose them to the pipe using vmsplice, taking turns (without NONBLOCK!)
#

no

#

there is no way to use nonblock

#

nonblock implies not waiting for the receiving end of the pipe

#

which is cheating

#

not waiting for the receiving end of the pipe is basically discarding data

#

it's like skipping chunks of output

#

why is it not

#

no

#

it will not

#

no

#

do you understand how vmsplice works, how it avoids copying by exposing just a buffer reference to the pipe receiver?

#

yes

#

also, the way i understand it, if the receiving end is fast enough, vmsplice doesn't block even without NONBLOCK

#

so for any performance testing, where the receiving end is always fast enough, it should make no difference

#

improvement of what exactly?

#

im not sure

#

maybe, maybe not

rapid wadi
#

why not?

#

i would think the pure throughput matters, assuming the receiving end of the pipe is coded optimally

#

"simple, performance-focused programming lang" sounds like C tbh :p

#

well depending on what you mean with simple