HTML manipulation library | Rust Programming Language Community | Page 1

elder wigeon Aug 16, 2022, 11:21 PM

#

I'm looking for a HTML manipulation library for Rust. I want to be able to deserialize a HTML document, select nodes, modify them, and finally serialize it back.
I found html5ever, which I assume is more than capable of doing this, but it calls itself "browser-grade", which means "complicated". The documentation isn't of any help (from the perspective of someone who does not know everything there is to know about HTML5 and related standards, like me) and there is almost no examples (and definitely no examples that clearly explain what they are doing). Moreover, the part that (I think) has an API for what I'm trying to do (markup5ever_rcdom) is, in words of it's creators:

[...] built for the express purpose of writing automated tests for the html5ever and xml5ever crates. It is not intended to be a production-quality DOM implementation, and has not been fuzzed or tested against arbitrary, malicious, or nontrivial inputs. No maintenance or support for any such issues will be provided. If you use this DOM implementation in a production, user-facing system, you do so at your own risk.
Long story short, it seems like something not suited for my use case.
There is scraper, which I used before and liked, but it can't modify HTML, only read from it.
I also found nipper. It looks like what I need, but it also looks like a dead project. I have nothing against a year of no updates for a library that hit the 1.0 mark already and is a finished, maintenance only product, but unstable (0.x) libraries that aren't actively developed are a sign of lost interest.
Maybe someone knows a good library that would satisfy my needs, or at least has a good resource on how to do what I want with html5ever, without using stuff I'm not supposed to be using.

novel dawn Aug 16, 2022, 11:31 PM

#

Maybe typed-html? It's getting kind of old but should still work fine.

#

You could also see if an xml crate will work for you

elder wigeon Aug 16, 2022, 11:37 PM

#

novel dawn Maybe `typed-html`? It's getting kind of old but should still work fine.

I'm looking through the docs and it seems like it's for creating new documents, not changing existing ones (it doesn't seem to have a function for deserializing HTML and also that's how it presents itself via it's description).
It also has the same problem nipper has, it's unmaintained and not finished (0.x).

elder wigeon Aug 16, 2022, 11:40 PM

#

novel dawn You could also see if an xml crate will work for you

While I'm not well-versed in web development, I realise that XML and HTML are very different, despite looking similar.
For example, the following:

<meta charset="utf-8">

...is valid HTML, but isn't valid XML (the tag isn't closed). So a XML parser should fail here, even though it's a perfectly valid HTML website.

#

So I'm definitely looking for something HTML-specific.

novel dawn Aug 16, 2022, 11:41 PM

#

This one has an example with HTML: https://crates.io/crates/quick-xml

elder wigeon Aug 16, 2022, 11:44 PM

#

Interesting.

#HTML manipulation library