January 14, 2015
I’ve notice recently that codebases tend to have a bunch of common misspellings. Obviously, programmers care more about the functionality (and hopefully cleanliness) of their code than whether or not they’ve spelled “occurred” correctly. However, I think that it’s important that code is free of mispellings. It makes the code look unprofessional if there are basic misspellings, particularly if that code is going to be read by customers or another outside party. To make it easier to spellcheck code quickly, I decided to write a Node.js script that would do it for me.
The script has a saved list of words based on Wikipedia’s list of commonly misspelled words that’s actually intended for machines to use. It has a lot of extra words (proper nouns in particular) that I cut out of the list that my script uses, but it’s pretty thorough overall.
Basically, the script takes in a list of files and, for each file, replaces all instances of the misspelled word with the correct spelling. There’s a few limitations on that:
In future versions I’d like to work on the first three, but I don’t know if the third issue is really solvable.
One other issue/inefficiency that I would like to fix is that currently all of the file reads and writes are done synchronously. Obviously, this goes against the expected style of a Node.js script. However, I had trouble with files occasionally getting wiped out rather than spellchecked when I did the reading asychronously, and I didn’t see any boost in performance to make that worthwhile. It’s an issue I’d like to work on in the next version.
I use the following command to run the script:
file -I **/* | awk '/us-ascii/ {print substr($1, 0, length($1)-1)}' | xargs node spellcheck
First, file outputs information (including the encoding) about each file in the directory as well as all subdirectories.
Next, awk filters out all lines that don’t contain ‘us-ascii’ and then pulls the first column (the file path and name) and trims off the last character.
Finally, those file paths are piped into node using xargs.
One limitation of this is that xargs will treat a space as a delimiter, so it breaks on file paths/names with spaces in them. I’ve tried using the -0 flag on xargs, which makes it only treat \0 characters as delimiters, but I haven’t been able to get awk to output the file names with the correct delimiters to have it get parsed correctly by xargs.
I initially started testing this with Neovim since:
My guess was that all of these things would contribute to a lot of misspellings that have been floating around for years.
I ran an early version of my spellchecker on Neovim, which resulted in pull request #827. That version of the script didn’t automatically replace the files, instead opting to output the misspellings for manual correction. That was unpleasant, but it found a lot of misspellings.
In the current version of Neovim, there are a few misspellings. I checked out that commit and ran the script as outlined above, which produced the following results:
That whole process took about 66 seconds on a 2009 Macbook Pro. Since there are about ~1700 ASCII files in Neovim, that means we processed about 26 files per second, which I think is pretty good.
Now, some of those were false positives (examples of typo correction in documentation files for the most part), but a quick git diff
caught those.
The results of my tests are included in pull request #1111.
I think my tool is close to being ready for real usage. Once I figure out the issue with spaces in file paths, it should be workable for many projects on GitHub, as well as many of my own.