Spell-checking using vcsn
This notebook shows how commands from Vcsn can be used to build a spell-checker (actually a spello-fixer).
We first build a minimal automaton (on ) to represent a (finite) language. To make it minimal, we built a codeterministic automaton, which we determinize.
context.cotrie returns an automaton on letterset (not nullableset), which is a good idea for determinize
. Since later we will want the labelset to be nullable, we convert lang
to an automaton using context c
. The resulting automaton is quite compact: compare the number of states with the number of words and the total number of characters.
235886 2493109 /usr/share/dict/words
Then we turn this automaton into a partial-identity transducer, i.e., each single-tape transition () is turned into a double-tape transition (). This is because we want to compose these automata.
Now we build the Levenhstein automaton for the same context.
Compose the Levenshtein automaton to the language transducer. This results into a transducer which on the left-tape is accepting the language fuzzily, and on the second tape, features the exact words.
For any candidate word, build the linear transducer that represents it.
Now compose the word and the fixer
to get a computation of the Levenstheim distance of w
to any word on the language.
Display the proposed correction.
Find the lightest path in this transducer: it maps our word to its closest word in the language.
We can ask for several proposals of close words.