Cleaning up manageiq.org typos using ruby and open source


#1

I noticed a typo here and there on the website and decided to see if I can detect others with ruby.

The result is here

I used a deprecated ruby gem, raspell with bindings to Aspell. Apparently, ffi-aspell is the one to use now but I had no issues with raspell.

To run, I had to install aspell:

For OSX:
brew install aspell

Then, install the gem:
gem install raspell

I then search for all of the website files that have data presented in the website or github.com: haml, yml, yaml, md.
The code is pretty simple.

spellchecker.rb:

#!/usr/bin/env ruby

require 'rubygems'
require 'raspell'
speller = Aspell.new('en_US')
speller.set_option("ignore-case", "true")
speller.set_option("sug-mode", "slow")

# Ignore words of length 3 or less
speller.set_option("ignore", "3")  

Dir.glob("/path/manageiq.org/**/*.{haml,yml,yaml,md}").each do |file|
    words = File.read(file).gsub(/\s+/, ' ').strip.split(/\s/)

    speller.list_misspelled(words).each do |mistake|
      puts "#{mistake.downcase}"
    end
end

This produces output like

allowfullscreen
rhev
rhevm
smartstate
manageiq
smartproxy
cloudforms
datastore
iscsi
datacenter

I then run the code like this:
ruby spellchecker.rb |sort | uniq -c |sort -n

sort: sort by word
uniq -c: display the word count from sort
sort -n: numerically sort the result of uniq

It outputs like this:

   1 vmfs
   1 workgroup
   2 bundler
   2 charset
   2 chromeframe
   ...
  10 href
  11 github
  11 openstack
  38 manageiq

Most typos occur less than 3 times, so I concentrate on the top of the list.

It’s pretty easy to then find the typos by doing:

git grep -i typo_from_above

#2

Really cool! Thanks @jrafanie


#3

Also fixed up the guides repo with this technique. Unfortunately, it finds many mistakes that are false positives. Some are tech terms that it doesn’t know about like refactor or ovirt.


#4

Cool stuff! Does the ability to supply a dictionary of allowable terms exist?


#5

@jerrykbiker Yes, aspell doesn’t seem that great at providing suggestions but it does offer a dictionary option. I haven’t actually used it for a dictionary but raspell exposes the interface to passing options to aspell. I use that same interface to set the “ignore-case” and “sug-mode” option above.

speller.set_option("ignore-case", "true")
speller.set_option("sug-mode", "slow")

Note, that ruby binding, raspell, is deprecated, in favor to using ffi-aspell