#Code review for small scraping project

13 messages · Page 1 of 1 (latest)

regal pelican
#

I've made a small web scraper, mostly as an exercise in doing it since it's my first attempt at such a thing. There's a few things I don't like about the structure - waiting for 1 second after each call seems arbitrary and I should probably be using async somewhere instead of pinging each chapter in order.
https://github.com/Pale-Vessel/ao3_chapter_lengths

GitHub

Contribute to Pale-Vessel/ao3_chapter_lengths development by creating an account on GitHub.

toxic shoal
#

Hi @regal pelican

ao3_chapter_lengths Code Review

Cargo.toml

Add metadata - Consider including optional but useful fields

[package]
name = "ao3_chapter_lengths"
version = "0.1.0"
edition = "2021"
authors = ["Your Name <[email protected]>"]
license = "MIT OR Apache-2.0"  # Common Rust convention
description = "Tool to analyze chapter lengths on AO3"
repository = "https://github.com/username/ao3_chapter_lengths"

main.rs

Liberal use of panic! and expect()

Your code will crash on many valid inputs:

  • Invalid work IDs
  • Network failures
  • Malformed HTML
  • Works with no chapters

Better approach: Use Result types and proper error handling.

You can use the anyhow crate, to simplify and streamline errors.

Incorrect curl flag

-http2 should be --http2 (double dash). This would silently fail or cause errors.

Fragile URL parsing

match &id_input.get(..5) {
    Some("https") => // ...
    Some("archi") => // ...
  • Checking only 5 characters is unreliable
  • Doesn't handle http:// (no 's')
  • Magic indices (.nth(4), .nth(2)) will break with different URL formats

Commented-out code

//.skip(1)

Remove this or document why it's there.

Confusing output format

_ => println!("{:?} {}", &lengths[1..], lengths[0]),

This prints chapters 2+ first, then chapter 1 separately. Confusing... Why?

This should be documented in a comment.

Raw string unnecessary

let url = format!(r"https://archiveofourown.org/works/{work_id}?view_full_work=true");

The r prefix is pointless here (no escape sequences to avoid).

Standard quotes would be more idiomatic unless you plan on adding complex regex or Windows file paths to the string later.

README.md and LICENSE.md

Readme and License are missing. It would help understanding the project and its rights to use.

regal pelican
#

Sorry if it wasn't, just some stuff about the phrasing feels iffy

toxic shoal
#

I did one manual and asked LLM as well

#

So 2 in one

#

I review LLM as well

#

I usually have my points and a subset of the LLM

regal pelican
#

Alright then

#

Addressing your points in order

  • that is a good point and I'll add some metadata
  • most of the crashes are intentional: if the user inputs an invalid ID there's no meaningful recovery that can be done; works with no chapters can't be uploaded so that always indicates a failure; if there's no network again nothing can be done. Malformed html could be checked to look for some errors but on the whole that just won't arise due to how well structured ao3 html is.
  • good spot, I didn't catch that incorrect flag
  • again good spot, I forgot some bowsers still use http. For the other points, even checking five characters is overkill tbh, since it's only designed to work with those three kinds of link - I don't think ao3 even has any other way of linking to a work. I could trim off view_full_work if it's present tho
  • the next two points go together, confusingly the first "chapter" is actually the full work which is why it's printed separately and why I was originally skipping it.
  • another good spot, that's an artefact from when I was creating a more complex formatted string
  • I should add a readme and license, you're right
toxic shoal
#

Chatbots can help writing README and License in a very fast way, of course manual editing is needed afterwards

#

I usually have the h1 and h2 titles and based on that, helps a lot

#

I always proof read afterwards