#[roast my Rust] Rust function to read and parse multiple markdown files

18 messages · Page 1 of 1 (latest)

haughty canopy
#

This works fine when the number of items is small (less than 1000 or so) but once the number of items is larger (above several thousand) it's incredibly slow.

#[tauri::command]
async fn get_all_collection_entries(root_path: &str) -> Result<Vec<serde_json::Value>, String> {
  use std::fs;
  use serde_json::{json};
  
  // https://stackoverflow.com/a/43395633/1971662
  // https://stackoverflow.com/a/26084812/1971662
  // https://www.dotnetperls.com/read-dir-rust

  struct ResultEntry {
    filepath: String,
    content: String
  }

  let mut files = Vec::new();
  for entry_raw in fs::read_dir(root_path).expect("Unable to read file") {
    let entry = entry_raw.expect("error");

    let entry_json: serde_json::Value = json!({
      "path": entry.path().to_str().unwrap(), 
      "name": entry.path().file_name().expect("error").to_str(),
      "ext": entry.path().extension().expect("error").to_str(),
      "stem": entry.path().file_stem().expect("error").to_str(),
      "content": std::fs::read_to_string(entry.path().to_str().unwrap()).expect("Unable to read file")
    });
    
    files.push(entry_json)
  }

  return Ok(files);
}

I'm a total Rust newbie... what am I doing wrong here? Why does this not work with a large number of files? What's a better way to approach this?

sacred mist
#

There’s nothing really wrong with your rust here as far as I can see

#

I would return a Vec<CollectionEntry> instead of Vec<JsonValue> since that just makes it much clearer what your code does

#

you can use derive Serialize for the. CollectionEntry to keep things tidy

#

your real problem here is independent of any programming language, this function will always be slow no matter what

#

because think about it, your reading thousands of files into memory that’s thousands of open files, probably millions of syscalls to the OS

#

and then you’re trying to send that incredibly large collection of stuff to the frontend where is has to pass through a step of JSON serialization and deserialization as well

#

That’s a bit like trying to send a whole database to the client at once

#

What you should do instead is paging like you would with a database, where the frontend requests only “slices” of the full collection

haughty canopy
# sacred mist What you should do instead is *paging* like you would with a database, where the...

Thank you, @sacred mist... that's really helpful feedback.

The challenge is that I'm trying to create a spreadsheet like UI where the frontend needs access the file content of all files (in a given collection) at the same time... for scrolling large lists like this I'm using virtualized scrolling (e.g. https://github.com/TanStack/virtual) but it's not clear to me what the scalable architecture here is for having access to the actual data...

#

A performance hit to initially load the entries is acceptable... but super high memory usage while using the app isn't

wet pike
#

Well to build on what Jonas mentioned, you can initially load say the first 100, and then read the scroll value of the list, and once the user is say halfway, request the next 50. If you want to be memory conscious, you would release the first 50 at this point too

#

And then you could utilize a cache like a LRU in your rust to prevent syscalls for the same files over and over again

haughty canopy
sacred mist
#

kinda yes? but they operate using very different constraints (dedicated hardware, not huge files but simple KV pairs etc)

wet pike
sacred mist
#

in your case, if you want to support searching, filtering and stuff like that you should ideally build up a search index of all these files on the Rust side and have the JS query that search index

#

and even better for searching you could spin up a thread that watches the directory for changes and rebuilds that index when necessary