#Rust Abstraction Challenge

1 messages · Page 1 of 1 (latest)

lapis galleon
#

Ok I'd like to start a small discussion (it sparkled from something I needed to do at my job), I want to see if it leads to anything interesting.

Suppose you have some company that provides a public API that lets you fetch a bunch of objects in pages (let's say you do a GET HTTP request and it gives you like the next 100 objects in a JSON list), and you need to write a script that has different routines inside of it that might process these objects differently, and you can choose one of them at startup. To clarify, all of them need to process the entirety of the objects that the API provides, so the routine needs to read all pages and process them continuously. It could process it in various ways: you could imagine that the objects have relations to each other and that the routine is constructing some graph, or that the routine is filtering some important information needed for something you'll do later, or whatever else. A routine might keep track of some state while it's processing these objects to do this - for example you could have a very simple routine that just counts the number of objects that it has fetched.

How would you abstract this so that you can implement different routines operating on the same requests?
(in Rust btw - may as well stay on one language xd)

hardy socket
#

whats stopping you from like just... passing the json objects as a Vec in a functions arguments?

#

like, i dont see how this is more complex than literally any basic polymorphism

lapis galleon
#

ok yeah maybe I shouldn't have said 100 that's too small lol

nocturne canopy
#

an Iterator that yields those objects and does pagination stuff?

hardy socket
#

heck i'd worry if it starts not fitting the Vec in RAM

lapis galleon
#

let's say 1 million to force streaming into the mix

#

or 1 billion whatever

hardy socket
lament fulcrum
#

so an un-paginated JSON API p much

lament fulcrum
lapis galleon
hardy socket
#

i mean, gimme the details on why not, otherwise this is pretty pointless idle guessing

lapis galleon
#

Because I'm talking about a routine that needs to operate on all the pages that get fetched and do something for the entirety of the data

#

If you want to have a function that does something for one of these pages then yeah it's pretty much just a function that takes a vec

hardy socket
#

yes if your routines need that you need to provide it all the data it needs. simplest is collecting them all into a vec and passing it along.

lapis galleon
#

What I'm talking about is not a routine that's just a function that does something on each page, but something that processes each page and may keep some state and do a summary calculation at the end.

hardy socket
#

and i'm asking "can you fetch all pages, stuff em into a Vec, and give your function that?"

#

and because the answer is most likely "no": why? what limits am i working with here?

weak dirge
#

i keep getting jumpscared by the word abstraction on this server bc i have an oc named abstraction lol

lapis galleon
hardy socket
#

these routines, whats their purpose?

#

because like yes, if you just want a paginated api call implement Iterator such that each .next returns a page, thats simple. i gotta be missing something which makes you ask here like this

lapis galleon
#

Heh... Maybe I just have a skill issue. Because turning the fetching process into an iterator is only something I came up with a day later after thinking that I needed to turn the routines into a trait and that it was probably a bad idea.

hardy socket
#

but like, higher level consideration: does it make sense to fetch that much data through a API?

lapis galleon
#

It made sense in my case yeah

#

(I had no choice lol)

hardy socket
#

so is this like, one time or needed constantly? how up-to-date does the data need to be

lapis galleon
#

In this case the data is fetched one time but it could be re-fetched later. Though in the context of my job eventually i'll just need to implement a separate routine that fetches the changes since a previous date instead.

#

anyways

#

Unrelated to my job now, but see in lokinit for instance, the API we have currently to get events from the window is one that's as simple as poll_event() -> Option<Event>. We're not as cross-platform as winit but BrightShard has made a lot of effort on MacOS so that we can have this API, because to manage the events of a window, MacOS "only" provides an API where you have to give it a class that has a method for each event that'll get called when these events come up, and where you give it control over the control flow of your application. Winit does the same, it has an EventLoop type and you have to call .run(|| closure-to-match-on-events-here) and give up control flow so that the underlying API (or Winit) does it.

#

It's honestly a pain to deal with an API like this as opposed to one where you poll events. It's no longer this simple thing where you have like, variables at the top, then some while loop that modifies them at each iteration, and then you exit by breaking. No instead it has to be this whole complicated thing where you first put the state in a struct that you modify in the individual methods — and when it comes to Rust, it might not even work because of some limitations of the borrow checker.

#

My experience with an API that basically boils down to an iterator has pretty much always been better, but I feel like I'm seeing a potentially more general pattern of abstraction in all of this. One where you're supposed to let the user stay in control of the control flow if your abstraction happens to have a somewhat complex one.
I thought I had a good example there but apparently not. It's just iterator and that's it. I wish I had a better example where people would obviously implement it as traits and I could somehow find a way to implement it as a... uh... state machine? idk :/
(and then compare the two to evaluate the pros and cons etc.)

hardy socket
#

i mean, you already mentioned polling which is essentially just calling .next manually

#

idk what more general abstraction here could be

#

if you truly have that much data i'f be much more interested in handling this efficiently maybe storing it locally as parquet and streaming over it with polars. or even some of the bigger data-engineering stuff out there which i have no clue about

lapis galleon
#

If you want the actual concrete use-case, I'm fetching a French database that contains public data about all healthcare practitioners / roles / organizations, and I have to do some data processing on them so that people can quickly search for professionals and organizations and a bunch of other stuff that's specific to our startup

#

I end up with data that spans almost 10 million rows in one table - not really billions but it's still big

hardy socket
#

yeah idk, i'd just stuff it into polars and see what happens

#

(low cost of failure high cost of planning)

#

10 million rows sounds right around the sweet spot where it could shine

lapis galleon
#

You mentioned doing it locally instead - well that's what I did today actually, I wrote a routine that just saves all the bundles I get into a file that has one big json blob per line and then I just have a second iterator that iterates over these lines if I want to do it locally. serde_json has a handy StreamDeserializer just for that so it was like 2 lines of code to implement

hardy socket
#

yeah with Polars it'd be something similar: you'd stuff it in a parquet file, and use the streaming API to apply dataframe operations on it

lapis galleon
#

I'm glad I can just use the StreamDeserializer then xD

hardy socket
#

plus you'd get the advantages of polars optimizing your operations kinda like a db would optimise a query, and executing them multithreaded

south iron
#

I thought of Streams immediately when reading this xd

#

I love them, makes working with async so much better