I've noticed recently that codebases tend to have a bunch of common misspellings. Obviously, programmers care more about the functionality (and hopefully cleanliness) of their code than whether or not they've spelled "occurred" correctly. However, I think that it's important that code is free of mispellings. It makes the code look unprofessional if there are basic misspellings, particularly if that code is going to be read by customers or another outside party. To make it easier to spellcheck code quickly, I decided to write a Node.js script that would do it for me.

How It Works

The script has a saved list of words based on Wikipedia's list of commonly misspelled words that's actually intended for machines to use. It has a lot of extra words (proper nouns in particular) that I cut out of the list that my script uses, but it's pretty thorough overall.

Basically, the script takes in a list of files and, for each file, replaces all instances of the misspelled word with the correct spelling. There's a few limitations on that:

  • Only lowercase words will be replaced
    • It's tough to do a quick replace of Occured and occured at the same time without having both as separate entries in the dictionary
  • Only words in the middle of sentences will be replaced
    • It uses " word " as the pattern to avoid accidentally changing substrings of correctly spelled words
  • Only ASCII files can be replaced
    • It expects ASCII encoding, so non-ASCII characters will get wiped out
    • Non-English words (particularly Spanish, French, etc.) often have words that would be misspellings in English but are correctly spelled in that language. Limiting it to ASCII avoids most non-English files.

In future versions I'd like to work on the first three, but I don't know if the third issue is really solvable.

One other issue/inefficiency that I would like to fix is that currently all of the file reads and writes are done synchronously. Obviously, this goes against the expected style of a Node.js script. However, I had trouble with files occasionally getting wiped out rather than spellchecked when I did the reading asychronously, and I didn't see any boost in performance to make that worthwhile. It's an issue I'd like to work on in the next version.

How To Run It

I use the following command to run the script:

file -I **/* | awk '/us-ascii/ {print substr($1, 0, length($1)-1)}' | xargs node spellcheck  

First, file outputs information (including the encoding) about each file in the directory as well as all subdirectories.

Next, awk filters out all lines that don't contain 'us-ascii' and then pulls the first column (the file path and name) and trims off the last character.

Finally, those file paths are piped into node using xargs.

One limitation of this is that xargs will treat a space as a delimiter, so it breaks on file paths/names with spaces in them. I've tried using the -0 flag on xargs, which makes it only treat \0 characters as delimiters, but I haven't been able to get awk to output the file names with the correct delimiters to have it get parsed correctly by xargs.

Example Usage

I initially started testing this with Neovim since:

  1. It's a fairly large codebase (~330k LOC, 1442 files)
  2. The code is very old
  3. The code is very messy

My guess was that all of these things would contribute to a lot of misspellings that have been floating around for years.

I ran an early version of my spellchecker on Neovim, which resulted in pull request #827. That version of the script didn't automatically replace the files, instead opting to output the misspellings for manual correction. That was unpleasant, but it found a lot of misspellings.

In the current version of Neovim, there are a few misspellings. I checked out that commit and ran the script as outlined above, which produced the following results:

  • runtime/autoload/ada.vim - actualy to actually
  • runtime/autoload/sqlcomplete.vim - foward to forward
  • runtime/autoload/sqlcomplete.vim - preceeded to preceded
  • runtime/autoload/syntaxcomplete.vim - begining to beginning
  • runtime/compiler/checkstyle.vim - preceeded to preceded
  • runtime/doc/os_win32.txt - noticable to noticeable
  • runtime/doc/tips.txt - teh to the
  • runtime/doc/usr_41.txt - otehr to other
  • runtime/doc/usr_41.txt - teh to the
  • runtime/doc/usr_41.txt - wnat to want
  • runtime/indent/ada.vim - preceeding to preceding
  • runtime/syntax/asm68k.vim - existance to existence
  • runtime/syntax/aspvbs.vim - helpfull to helpful
  • runtime/syntax/aspvbs.vim - recived to received
  • runtime/syntax/chill.vim - peristent to persistent
  • runtime/syntax/clipper.vim - beggining to beginning
  • runtime/syntax/d.vim - statment to statement
  • runtime/syntax/dtrace.vim - divison to division
  • runtime/syntax/ia64.vim - orignally to originally
  • runtime/syntax/kix.vim - preperation to preparation
  • runtime/syntax/mel.vim - usefull to useful
  • runtime/syntax/php.vim - begining to beginning
  • runtime/syntax/postscr.vim - compatability to compatibility
  • runtime/syntax/redif.vim - occurence to occurrence
  • runtime/syntax/samba.vim - prefered to preferred
  • runtime/syntax/sather.vim - developped to developed
  • runtime/syntax/specman.vim - succeded to succeeded
  • runtime/syntax/spup.vim - beginnig to beginning
  • runtime/syntax/tcl.vim - propogate to propagate

That whole process took about 66 seconds on a 2009 Macbook Pro. Since there are about ~1700 ASCII files in Neovim, that means we processed about 26 files per second, which I think is pretty good.

Now, some of those were false positives (examples of typo correction in documentation files for the most part), but a quick git diff caught those.

The results of my tests are included in pull request #1111.

Conclusion

I think my tool is close to being ready for real usage. Once I figure out the issue with spaces in file paths, it should be workable for many projects on GitHub, as well as many of my own.