I've made a small web scraper, mostly as an exercise in doing it since it's my first attempt at such a thing. There's a few things I don't like about the structure - waiting for 1 second after each call seems arbitrary and I should probably be using async somewhere instead of pinging each chapter in order.
https://github.com/Pale-Vessel/ao3_chapter_lengths
#Code review for small scraping project
13 messages · Page 1 of 1 (latest)
Hi @regal pelican
ao3_chapter_lengths Code Review
Cargo.toml
Add metadata - Consider including optional but useful fields
[package]
name = "ao3_chapter_lengths"
version = "0.1.0"
edition = "2021"
authors = ["Your Name <[email protected]>"]
license = "MIT OR Apache-2.0" # Common Rust convention
description = "Tool to analyze chapter lengths on AO3"
repository = "https://github.com/username/ao3_chapter_lengths"
main.rs
Liberal use of panic! and expect()
Your code will crash on many valid inputs:
- Invalid work IDs
- Network failures
- Malformed HTML
- Works with no chapters
Better approach: Use Result types and proper error handling.
You can use the anyhow crate, to simplify and streamline errors.
Incorrect curl flag
-http2 should be --http2 (double dash). This would silently fail or cause errors.
Fragile URL parsing
match &id_input.get(..5) {
Some("https") => // ...
Some("archi") => // ...
- Checking only 5 characters is unreliable
- Doesn't handle
http://(no 's') - Magic indices (
.nth(4),.nth(2)) will break with different URL formats
Commented-out code
//.skip(1)
Remove this or document why it's there.
Confusing output format
_ => println!("{:?} {}", &lengths[1..], lengths[0]),
This prints chapters 2+ first, then chapter 1 separately. Confusing... Why?
This should be documented in a comment.
Raw string unnecessary
let url = format!(r"https://archiveofourown.org/works/{work_id}?view_full_work=true");
The r prefix is pointless here (no escape sequences to avoid).
Standard quotes would be more idiomatic unless you plan on adding complex regex or Windows file paths to the string later.
README.md and LICENSE.md
Readme and License are missing. It would help understanding the project and its rights to use.
Can I just ask, was this response generated by an LLM?
Sorry if it wasn't, just some stuff about the phrasing feels iffy
I did one manual and asked LLM as well
So 2 in one
I review LLM as well
I usually have my points and a subset of the LLM
Alright then
Addressing your points in order
- that is a good point and I'll add some metadata
- most of the crashes are intentional: if the user inputs an invalid ID there's no meaningful recovery that can be done; works with no chapters can't be uploaded so that always indicates a failure; if there's no network again nothing can be done. Malformed html could be checked to look for some errors but on the whole that just won't arise due to how well structured ao3 html is.
- good spot, I didn't catch that incorrect flag
- again good spot, I forgot some bowsers still use http. For the other points, even checking five characters is overkill tbh, since it's only designed to work with those three kinds of link - I don't think ao3 even has any other way of linking to a work. I could trim off
view_full_workif it's present tho - the next two points go together, confusingly the first "chapter" is actually the full work which is why it's printed separately and why I was originally skipping it.
- another good spot, that's an artefact from when I was creating a more complex formatted string
- I should add a readme and license, you're right