Rust Abstraction Challenge | NAMTAO Productions | Page 1

lapis galleon Jun 20, 2024, 4:58 PM

#

Ok I'd like to start a small discussion (it sparkled from something I needed to do at my job), I want to see if it leads to anything interesting.

Suppose you have some company that provides a public API that lets you fetch a bunch of objects in pages (let's say you do a GET HTTP request and it gives you like the next 100 objects in a JSON list), and you need to write a script that has different routines inside of it that might process these objects differently, and you can choose one of them at startup. To clarify, all of them need to process the entirety of the objects that the API provides, so the routine needs to read all pages and process them continuously. It could process it in various ways: you could imagine that the objects have relations to each other and that the routine is constructing some graph, or that the routine is filtering some important information needed for something you'll do later, or whatever else. A routine might keep track of some state while it's processing these objects to do this - for example you could have a very simple routine that just counts the number of objects that it has fetched.

How would you abstract this so that you can implement different routines operating on the same requests?
(in Rust btw - may as well stay on one language xd)

hardy socket Jun 20, 2024, 4:59 PM

#

whats stopping you from like just... passing the json objects as a Vec in a functions arguments?

#

like, i dont see how this is more complex than literally any basic polymorphism

lapis galleon Jun 20, 2024, 5:00 PM

#

ok yeah maybe I shouldn't have said 100 that's too small lol

nocturne canopy Jun 20, 2024, 5:00 PM

#

an Iterator that yields those objects and does pagination stuff?

hardy socket Jun 20, 2024, 5:00 PM

#

heck i'd worry if it starts not fitting the Vec in RAM

lapis galleon Jun 20, 2024, 5:00 PM

#

let's say 1 million to force streaming into the mix

#

or 1 billion whatever

hardy socket Jun 20, 2024, 5:00 PM

#

hardy socket heck i'd worry if it starts not fitting the Vec in RAM

.

hardy socket Jun 20, 2024, 5:01 PM

#

lapis galleon or 1 billion whatever

that still potentially fits in ram

lament fulcrum Jun 20, 2024, 5:01 PM

#

so an un-paginated JSON API p much

lament fulcrum Jun 20, 2024, 5:02 PM

#

hardy socket that still potentially fits in ram

but do you want to fit all of that into ram?

lapis galleon Jun 20, 2024, 5:02 PM

#

hardy socket whats stopping you from like just... passing the json objects as a Vec in a func...

Ok hold on. Do you mean one page?

hardy socket Jun 20, 2024, 5:02 PM

#

i mean, gimme the details on why not, otherwise this is pretty pointless idle guessing

lapis galleon Jun 20, 2024, 5:02 PM

#

Because I'm talking about a routine that needs to operate on all the pages that get fetched and do something for the entirety of the data

#

If you want to have a function that does something for one of these pages then yeah it's pretty much just a function that takes a vec

hardy socket Jun 20, 2024, 5:03 PM

#

yes if your routines need that you need to provide it all the data it needs. simplest is collecting them all into a vec and passing it along.

lapis galleon Jun 20, 2024, 5:04 PM

#

What I'm talking about is not a routine that's just a function that does something on each page, but something that processes each page and may keep some state and do a summary calculation at the end.

hardy socket Jun 20, 2024, 5:05 PM

#

and i'm asking "can you fetch all pages, stuff em into a Vec, and give your function that?"

#

and because the answer is most likely "no": why? what limits am i working with here?

weak dirge Jun 20, 2024, 5:06 PM

#

i keep getting jumpscared by the word abstraction on this server bc i have an oc named abstraction lol

lapis galleon Jun 20, 2024, 5:06 PM

#

hardy socket and i'm asking "can you fetch all pages, stuff em into a Vec, and give your func...

no that's too much

lapis galleon Jun 20, 2024, 5:06 PM

#

hardy socket and because the answer is most likely "no": why? what limits am i working with h...

Imagine billions of objects that don't fit in ram

hardy socket Jun 20, 2024, 5:06 PM

#

lapis galleon no that's too much

aight so we're talking properly big scale

#

these routines, whats their purpose?

#

because like yes, if you just want a paginated api call implement Iterator such that each .next returns a page, thats simple. i gotta be missing something which makes you ask here like this

lapis galleon Jun 20, 2024, 5:10 PM

#

Heh... Maybe I just have a skill issue. Because turning the fetching process into an iterator is only something I came up with a day later after thinking that I needed to turn the routines into a trait and that it was probably a bad idea.

hardy socket Jun 20, 2024, 5:12 PM

#

but like, higher level consideration: does it make sense to fetch that much data through a API?

lapis galleon Jun 20, 2024, 5:12 PM

#

It made sense in my case yeah

#

(I had no choice lol)

hardy socket Jun 20, 2024, 5:13 PM

#

so is this like, one time or needed constantly? how up-to-date does the data need to be

lapis galleon Jun 20, 2024, 5:14 PM

#

In this case the data is fetched one time but it could be re-fetched later. Though in the context of my job eventually i'll just need to implement a separate routine that fetches the changes since a previous date instead.

#

anyways

#

Unrelated to my job now, but see in lokinit for instance, the API we have currently to get events from the window is one that's as simple as poll_event() -> Option<Event>. We're not as cross-platform as winit but BrightShard has made a lot of effort on MacOS so that we can have this API, because to manage the events of a window, MacOS "only" provides an API where you have to give it a class that has a method for each event that'll get called when these events come up, and where you give it control over the control flow of your application. Winit does the same, it has an EventLoop type and you have to call .run(|| closure-to-match-on-events-here) and give up control flow so that the underlying API (or Winit) does it.

#

It's honestly a pain to deal with an API like this as opposed to one where you poll events. It's no longer this simple thing where you have like, variables at the top, then some while loop that modifies them at each iteration, and then you exit by breaking. No instead it has to be this whole complicated thing where you first put the state in a struct that you modify in the individual methods — and when it comes to Rust, it might not even work because of some limitations of the borrow checker.

#

My experience with an API that basically boils down to an iterator has pretty much always been better, but I feel like I'm seeing a potentially more general pattern of abstraction in all of this. One where you're supposed to let the user stay in control of the control flow if your abstraction happens to have a somewhat complex one.
I thought I had a good example there but apparently not. It's just iterator and that's it. I wish I had a better example where people would obviously implement it as traits and I could somehow find a way to implement it as a... uh... state machine? idk :/
(and then compare the two to evaluate the pros and cons etc.)

hardy socket Jun 20, 2024, 5:30 PM

#

i mean, you already mentioned polling which is essentially just calling .next manually

#

idk what more general abstraction here could be

#

if you truly have that much data i'f be much more interested in handling this efficiently maybe storing it locally as parquet and streaming over it with polars. or even some of the bigger data-engineering stuff out there which i have no clue about

lapis galleon Jun 20, 2024, 5:35 PM

#

If you want the actual concrete use-case, I'm fetching a French database that contains public data about all healthcare practitioners / roles / organizations, and I have to do some data processing on them so that people can quickly search for professionals and organizations and a bunch of other stuff that's specific to our startup

#

I end up with data that spans almost 10 million rows in one table - not really billions but it's still big

hardy socket Jun 20, 2024, 5:36 PM

#

yeah idk, i'd just stuff it into polars and see what happens

#

(low cost of failure high cost of planning)

#

10 million rows sounds right around the sweet spot where it could shine

lapis galleon Jun 20, 2024, 5:38 PM

#

You mentioned doing it locally instead - well that's what I did today actually, I wrote a routine that just saves all the bundles I get into a file that has one big json blob per line and then I just have a second iterator that iterates over these lines if I want to do it locally. serde_json has a handy StreamDeserializer just for that so it was like 2 lines of code to implement

hardy socket Jun 20, 2024, 5:41 PM

#

yeah with Polars it'd be something similar: you'd stuff it in a parquet file, and use the streaming API to apply dataframe operations on it

lapis galleon Jun 20, 2024, 5:41 PM

#

I'm glad I can just use the StreamDeserializer then xD

hardy socket Jun 20, 2024, 5:47 PM

#

plus you'd get the advantages of polars optimizing your operations kinda like a db would optimise a query, and executing them multithreaded

south iron Jun 21, 2024, 2:36 AM

#

I thought of Streams immediately when reading this xd

#

I love them, makes working with async so much better

#Rust Abstraction Challenge