#Add automated spell checking to Tokio do...

44 messages · Page 1 of 1 (latest)

faint delta

I extracted the questions from the Markdown document I wrote here for quick answering:

    0) Q) What should the branch name be?
       A)

    1) Technical Term Spellings:
        Q) For technical terms that may have multiple accepted spellings,
           should we include all variations in the dictionary, or should we consult the community/someone to
           decide on a preferred spelling?
        A)

    2) Script Inclusion in the PR:
        Q) Should the script used for automating the spell checker process be included in the pull request?
           More over, is there a preference for the language used? (e.g., Python, Bash, Rust)
           For example: update_dictionary.sh
        A)

    3) Location for the Automation Script:
        Q) If 2) is included, what would be the ideal location for this script within the project structure?
        a)

    4) Iterative Refinement Process:
        Q) How should we handle the iterative refinement of the spell checker dictionary?
           Specifically,  any specific suggestions or preferences for the process of reviewing and updating the dictionary
           based on flagged words from new or changed documentation files?
           For example, a PR introduces a tokio-specific word that doesnt exist in the dictionary.
           Should the developer add it to the dictionary as part of his PR? Or should it get automatically picked up in
           some process later on. 
        A)

    5) Handling Tokio-specific Terms:
        Q) In cases where a term is specific to Tokio and not part of the standard English lexicon,
           what should be our approach for verifying and adding these terms to the .spellcheck.dic file?
        A)

    6) Dealing with Non-Existent or Incorrect Words:
        Q) When we encounter words that appear to be made-up or incorrect, what should be the protocol
           for reaching out to the original author for clarification?
        A)

Full markdown file is linked below

half oasis

@faint delta That's a comprehensive set of questions. I will try to answer them.

1)Technical Term Spellings:
    Q) For technical terms that may have multiple accepted spellings,
       should we include all variations in the dictionary, or should we consult the community to
       decide on a preferred spelling?
    A) Just pick something. You may add several spellings, or you may pick one and change all other uses to match. As
       long as it's reasonable, it doesn't matter.

2) Script Inclusion in the PR:
    Q) Should the script used for automating the spell checker process be included in the pull request?
    A) This is not a requirement for solving this issue, but a nice-to-have. I would accept a PR without it, but if you
       have a script that works well, then it would be nice to include it.

3) Location for the Automation Script:
    Q) If 2) is included, what would be the ideal location for this script within the project structure?
    A)  You can create a new folder at the top of the repository for it. (In that case, I would move
       `.spellcheck.toml` and `.spellcheck.dic` into the same folder and remove the leading periods.)

4) Iterative Refinement Process:
    Q) How should we handle the iterative refinement of the spell checker dictionary?
       Do you have any specific suggestions or preferences for the process of reviewing and updating the dictionary
       based on flagged words from new or changed documentation files?
       For example, a PR introduces a tokio-specific word that doesnt exist in the dictionary.
       Should the developer add it to the dictionary as part of his PR? Or should it get automatically picked up in
       CI process.
    A) Their PR would fail the spellcheck CI step, and they would need to update the PR to add it to the dictionary to
       get their PR merged.
5) Handling Tokio-specific Terms:
    Q) In cases where a term is specific to Tokio and not part of the standard English lexicon,
       what should be our approach for verifying and adding these terms to the .spellcheck.dic file?
    A) If they are used as normal words, then just add them to the dictionary.

       However, if the word is really the name of a type (e.g. "TcpStream"), then the documentation should be updated to
       wrap the type name in backtics: `TcpStream`. The `cargo spellcheck` tool will ignore such words, since type names
       generally are not in the dictionary. (Exceptions can be made for type names used very commonly without
       backticks.)

6) Dealing with Non-Existent or Incorrect Words:
    Q) When we encounter words that appear to be made-up or incorrect, what should be the protocol
       for reaching out to the original author for clarification?
    A) Just rewrite the text to avoid the word. If you don't know what is meant, you can ask the user, or you can just
       ask on the Discord. It does not have to be the original author that answers.

7) Guidance on Scripting Language Preference:
    Q) Do you have a preference for the scripting language to be used for automating the spell checking process
       (e.g., Python, Shell script)?
    A) Python and shell scripts both seem reasonable.
bright elbow

Hi!
I've been working on this issue for day and a half, so I thought I can share what I've found and help a little bit.

Technical Term Spellings
There will be instances of these terms throughout the code base. See waker for example in the partial work provided by @half oasis . My approach was to search in web to find out whether a particular term is acceptable or not although may no be strictly correct. For example, version is used as a verb in versioned. This will be flagged by hunspell, but it can be acceptable in the realm of software development. It a little bit depends on your judgment perhaps. My judgement tells me that PRs in tokio are thoroughly reviewed, so terms that are not obviously a mistake, most probably aren't.

Script Inclusion in the PR
Location for the Automation Script
I think just adding a CI step alongside an updated CONTRIBUTIONS.md that explains how to do the spellchecking locally should be enough.

Iterative Refinement Process
The .dic file, if changed, will be shown in the diff of the PR, and can be reviewed like the rest of code. So I think whoever open the PR should be responsible for updating the .dic file.

Handling Tokio-specific Terms
If a term is present in the public API, I believe it should not be flagged. There are words like Spawner that will be flagged that are correct terms in the context of tokio, so like technical terms, I suggest relying on your judgement mostly.

Feel free to message me if you need help with anything regarding this issue : )

P.S.: These are only my thoughts, so what @half oasis provided as answers are the guidelines you should follow

faint delta

Thank you guys for responding! I'm a newbie so I appreciate any help that I can get.
@bright elbow Is it okay if I message you regarding the Ci-step? I definitely will need help when I get there

faint delta

I made some progress last night on this feature.
I was able to parse out all of the comments (lines starting with /// and //! ) inside of the .rs files.
I did this in python because it was a bit hard for me to use grep and ignore cases where i was in code blocks (that dont start with text)
I then cleaned up the output with some basic substitutions. After that, I split sentences into single words, and appended those words to a python list .
I also added all of the words inside of the markdown files.

I then the python list containing all the words, and appended them each on a new line to a local file so I could run them against hunspell to generate a personal dictionary.

There are some cases where the processing I did with naive string substitution wasn't really very good. For example, the grouping of words that have async is kinda cringe.

async
async_
async_stream
async await
async bufread
asyncbufreadext
asyncfd
asyncfdreadyguard
asynchronous
asynchronously
asyncread
asyncreadasyncwrite
asyncreadext
asyncseek
asyncseekext

In part, this makes me want to re-evaluate my approach. Perhaps passing the sentences through nltk or spacy to retrieve tokens/words would be a bit more robust... but I am questioning whether that is me making things complicated or if it's actually a good idea.

Regardless, I'll need to clean this up before I start running the words against hunspell to generate the personal dictionary

half oasis

I mean, asyncbufreadext sounds like AsyncBufReadExt being made lowercase. And that's a typename that should be wrapped in backticks rather than added to the dictionary.

faint delta

Picking back up on this over the weekend~ work was crazy last week.
I ended up extended the filter method to use the vanilla spacy tokenizer. The results immediately improved a lot and it didn't introduce too much code complexity but almost feels like using set theory to perform basic arithmetic. like the solution is too strong for the problem
for comparison, 50 docs:

A
API
All
An
Another
AsyncRead
AsyncWrite
Attempting
Basic
Be
Because
Box::pin
BroadcastStream
Click
Combine
Conversion
Converts
Creates
CtrlBreak
CtrlC

vs

A
API
All
An
Another
AsyncRead
AsyncReadAsyncWrite
AsyncWrite
Attempting
Basic
Be
Because
Box
Click
Combine
Conversion
Converts
Creates
CtrlBreak
CtrlC
faint delta
half oasis I mean, asyncbufreadext sounds like AsyncBufReadExt being made lowercase. And th...

Gotcha.
So for 5A) I will specifically go to the documentation and validate which are types and add those backticks manually.
My process will be as follows:

  • use the filter script to process all the words.
  • if a word looks like a Type, i'll use ripgrep to search the root dir
  • after manually validating it is a Type that's used, i'll accept-all in hunspell (but not insert) so it's no longer flagged in my dictionary construction process.
  • after that, i'll wrap the Type in backticks whereever its used in the docs
half oasis

Sounds reasonable to me.

Let me know if it takes too long to do it that way. We can find another solution in that case.

faint delta

Alright, i'll let you know-- but I expect it will be okay. Thank you!

faint delta

Keep making progress -> hunspell lets you specify multiple dictionaries so I experimented with moving some of the syscalls into its own dictionary.. but I can see an argument for these syscalls to be wrapped in Backticks instead.

More cleanup will definitely be necessary, but I'm continuously working through the directories to check any words i've missed. So far I've run though tokio-util, and tokio-test.

Oh wow, it's interesting that theres  infront of ± in the markdown version of the dic. Maybe that's why this rule wasnt matching

Will definitely have to refine the dictionary at some point to properly use the flags but thats a battle for another day. it's almost 4am haha fml

faint delta

@half oasis This PR is almost ready for review.
The Automation component was a bit ambitious and I didn't really know how cargo-spellcheck / hunspell worked at first... so I wont include any of that, and just include dictionary and configuration files
Currently, the personal dictionary has grown to about 340 lines (mixture of words and wordstems with flags)

I found out that cargo-spellcheck has a recursive flag, and I just need to go over each directory to make sure i've seen everything that's flagged.

I think I will be able to get everything wrapped up by tomorrow night

half oasis

awesome

bright elbow

@faint delta Hi there! I wanted to check in – did you open this PR ? Just want to make sure so we can avoid any duplicate efforts. Thanks!

faint delta

Hey! I’m almost done

I’m doing some final clean up work

faint delta

@bright elbow
I'm actually trying to create the PR now but having some issues

bright elbow

What issues?

Also you probably want to communicate with the other contributor that is working on this issue.

faint delta

Yeah, I didn't realize there was someone else who worked on it D:
It seems like our work is mostly duplicating eachother, although it seems I was a bit more specific and didn't include any types

ERROR: Permission to tokio-rs/tokio.git denied to arvaer.
fatal: Could not read from remote repository.
bright elbow

Are you trying to push directly to https://github.com/tokio-rs/tokio?

faint delta

not https but yes

Should I make a fork?

bright elbow

Yes. Push your local branch to your fork, and then Github will prompt you to open a PR.

faint delta

Okay, will do that shortly! sorry about that

bright elbow

No worries at all

faint delta

filling out the PR template now~
I think the main difference between my PR and my colleague's is that i've taken a bit more care to remove the types and use the hunspell flags, and have not added the CI component. Whereas has mostly included everything but added a CI component. I'll message him on the issue right after I put out the PR

faint delta
bright elbow

Thanks.

faint delta

I'm so sorry I didn't realize I wasn't following the proper commit message guidelines, i'll make sure to do so from now on.

bright elbow

That's okay. The commit message guideline is for the final (squashed) commit.

faint delta

Oh i see!

faint delta

@bright elbow Hi Amin’! Apologies for the late message,
is there anyway i can assist you with the PR?