How to Write a Spelling Corrector
This is a Java 5/6 version for Norvig’s spelling corrector.
The main objective was create the smallest Java version, and not focusing on performance.
In the Norvig’s article, you can find the explanation, and other implementations (your favorite language can be there!).
As you can count, this implementation has 42 lines (or 35, if you want to follow the Norvig counting and not include the blank lines). Not so beautiful as the Norvig’s python (as any statically typed language), not so huge as the C version.
In order to use, you must:
- download the dictionary file: big.txt (sorry Norvig, hotlink!)
- compile the code in the command line: javac Spelling.java
- exec in the command line: **java Spelling
** - example: java Spelling speling should output spelling
This implementation also uses only JRE default libraries, and don’t requires external or custom libraries.
If you want a better performance version, checkout the improved version on GitHub.
History:
2007/08/13 - Original version
2007/08/15 - Removed a useless method (remained in code from the draft version)
2007/08/16 - Changed the static initializer to a constructor, removed the temp buffer in the file loading
2007/08/16 - Removed similar methods
2007/08/17 - Replace the tailored max method for the java.util.Collections.max()
2007/09/03 - Additional notes added
2007/12/11 - Some tricks and more one line removed
2007/12/12 - Added a Groovy version
2008/12/07 - Some tricks and more one line removed (thanks to Anil Madamala, “Passionate about programming”)
2009/02/09 - Useless line removed (thanks to Sergey Mikhanov)
2019/11/28 - Adding the fast version on GitHub
import java.io.*;
import java.util.*;
import java.util.regex.*;
class Spelling {
private final HashMap<String, Integer> nWords = new HashMap<String, Integer>();
public Spelling(String file) throws IOException {
BufferedReader in = new BufferedReader(new FileReader(file));
Pattern p = Pattern.compile("\\w+");
for(String temp = ""; temp != null; temp = in.readLine()){
Matcher m = p.matcher(temp.toLowerCase());
while(m.find()) nWords.put((temp = m.group()), nWords.containsKey(temp) ? nWords.get(temp) + 1 : 1);
}
in.close();
}
private final ArrayList<String> edits(String word) {
ArrayList<String> result = new ArrayList<String>();
for(int i=0; i < word.length(); ++i) result.add(word.substring(0, i) + word.substring(i+1));
for(int i=0; i < word.length()-1; ++i) result.add(word.substring(0, i) + word.substring(i+1, i+2) + word.substring(i, i+1) + word.substring(i+2));
for(int i=0; i < word.length(); ++i) for(char c='a'; c <= 'z'; ++c) result.add(word.substring(0, i) + String.valueOf(c) + word.substring(i+1));
for(int i=0; i <= word.length(); ++i) for(char c='a'; c <= 'z'; ++c) result.add(word.substring(0, i) + String.valueOf(c) + word.substring(i));
return result;
}
public final String correct(String word) {
if(nWords.containsKey(word)) return word;
ArrayList<String> list = edits(word);
HashMap<Integer, String> candidates = new HashMap<Integer, String>();
for(String s : list) if(nWords.containsKey(s)) candidates.put(nWords.get(s),s);
if(candidates.size() > 0) return candidates.get(Collections.max(candidates.keySet()));
for(String s : list) for(String w : edits(s)) if(nWords.containsKey(w)) candidates.put(nWords.get(w),w);
return candidates.size() > 0 ? candidates.get(Collections.max(candidates.keySet())) : word;
}
public static void main(String args[]) throws IOException {
if(args.length > 0) System.out.println((new Spelling("big.txt")).correct(args[0]));
}
}
Additional notes:
- In a real usage, you should not read the dictionary file (big.txt) all the times.
Groovy Version
This is a Groovy version and exists just for fun. It’s too slow!
As you can count, it has 25 lines (or 22 without the blank lines).
It requires at least the 1.5 Groovy version.
def train(f){
def n = [:]
new File(f).eachLine{it.toLowerCase().eachMatch(/\w+/){n[it]=n[it]?n[it]+1:1}}
n
}
def edits(word) {
def result = [], n = word.length()-1
for(i in 0..n) result.add(word[0..<i] + word.substring(i+1))
for(i in 0..n-1) result.add(word[0..<i] + word[i+1] + word[i, i+1] + word.substring(i+2))
for(i in 0..n) for(c in 'a'..'z') result.add(word[0..<i] + c + word.substring(i+1))
for(i in 0..n) for(c in 'a'..'z') result.add(word[0..<i] + c + word.substring(i))
result
}
def correct(word, nWords) {
if(nWords[word]) return word
def list = edits(word), candidates = [:]
for(s in list) if(nWords[s]) candidates[nWords[s]] = s
if(candidates.size() > 0) return candidates[candidates.keySet().max()]
for(s in list) for(w in edits(s)) if(nWords[w]) candidates[nWords[w]] = w
return candidates.size() > 0 ? candidates[candidates.keySet().max()] : word
}
println(correct(args[0], train("big.txt")))