#Speeding up a BufReader word counter

31 messages · Page 1 of 1 (latest)

fervent flare
#

Hey all,

I'm a JS developer and have been trying my hand at Rust for the past couple of months and would like your honest feedback on what could be improved.

I'm trying my hand at using Neon bindings for Node but in its current state my JS code is faster than the one written in Rust which doesn't really make sense to me so I'm guessing it's more my ineptitude at Rust than anything else.

This is my code:
https://github.com/NullDivision/rust-neon-demo/blob/poc/BasicProject/src/lib.rs

GitHub

Contribute to NullDivision/rust-neon-demo development by creating an account on GitHub.

torn jay
#

also this is probably a case where i'd recommend reading the entire thing into memory at once, if possible

#

it would let you work with references into the data, rather than having to create strings

fervent flare
#

Wow. was not expecting the huge difference in performance between release vs dev builds. Now it's comparable but not much of a difference between JS and Rust

Reading the entire file in would be a problem if i were to try to do that with a large Gb size file but I get the point of the trade off between memory and cycle time

#

Thank you for the "release" suggestion 🙂

torn jay
#

there's a lot of room for improvement

fervent flare
#

by all means, I'm still learning so it would be great to have some insight on what to look into

torn jay
#

Can you refactor this into a function that deals only with rust types, which is then called by a function that handles the javascript stuff?

#

this sort of stuff is a lot easier if you can make a clear boundary

#

(it would also make it much easier to test it)

fervent flare
#

was meaning to do that at some point. you're right. one that would generate the words hashmap and one to generate the output text i'd imagine

torn jay
#

at a glance, i'd suggest something like ```rust
struct MyError;

fn count_words_impl(src: impl Read, dst: impl Write) -> Result<(), MyError>{
...
}

fn count_words(mut cx: FunctionContext) -> JsResult<JsString> {
let filepath = cx.argument::<JsString>(0)?.value(&mut cx);

println!("Counting words in {}", filepath);

match File::open(filepath) {
    Ok(file) => {
        let mut word_count: HashMap<String, u32> = HashMap::new();
        let mut writer = BufWriter::new(File::create("./output.txt").expect("Unable to create file"));
        let reader = BufReader::new(file);
        
        
        count_words_impl(reader, writer)

        Ok(cx.string(""))
    },
    Err(e) => cx.throw_error(e.to_string()),
}

}```

#

there's also a bunch of things like rust writer.write_all(format!( "{:word_length$} | {:count_length$}\n", k, v, word_length = word_length, count_length = count_length ) where you could just use write! and not make intermediate strings

#

and v.to_string().len()

sonic sphinx
#

but still is good practice

sonic sphinx
#

oh right, you can see the js influence with calling for_each :P

fervent flare
#

is for-in a better alternative or is it more a flavor thing? 🙂

torn jay
#

it doesn't really matter

#

that said, you should really be bubbling up that error rather than unwrapping it, which is easier/cleaner in a for .. in loop

#

but this will be easier to review if you make that refactor i suggested above, so please lmk when you have done that ferrisOwO

fervent flare
#

after some poking, prodding, refactoring and remembering to use references instead of moving variables managed to reduce the time considerably. thank you @torn jay 🙂 if you still feel like it and have any more insight to offer, i'd appreciate it

torn jay
#

sure

#

how much faster did you manage to get?

torn jay
#

and instead of rust for (word, count) in &word_count_vec { write!( &mut writer, "{:word_length$} | {:count_length$}\n", word, count, word_length = word_length, count_length = count_length ).expect("Unable to write to file"); } do rust for (word, count) in &word_count_vec { writeln!(&mut writer, "{word:word_length$} | {count:count_length$}")?; }

#

and you can do that everywhere

#

so for example you can write ```rust
fn write_to_file(word_count_vec: Vec<(&str, &u32)>) -> Result<(), Error> {
let mut writer = BufWriter::new(File::create("./output.txt"))?;
let mut word_length = 0;
let mut count_length = 0;

for (word, count) in &word_count_vec {
    word_length = max(word_length, word.len());
    count_length = max(count_length, count.len());
}

writeln!(&mut writer, "Word{:word_length$} | Count{:count_length$}", "", "" )?;

for _ in 0..(word_length + word_length + 3){
    write!(&mut writer, "-")?;
}

writeln!(&mut writer)?;

for (word, count) in &word_count_vec {
    writeln!(&mut writer, "{word:word_length$} | {count:count_length$}")?;
}

writer.flush()?;

Ok(())

}``` which doesn't allocate anything, and will properly handle errors

fervent flare
#

that's a neat trick

  • refactored the writeln to shorthand versions
  • remembered if let exists

managed to get the full call through Neon down to 1.30s. by comparison the Node implementation is around 1.60s for the same 150k lines processed. even on an empty file it's not gonna go faster than .80s due to the Node runtime