Contact
CoCalc Logo Icon
StoreFeaturesDocsShareSupport News AboutSign UpSign In
| Download
Views: 45895
Kernel: Python 3

Spell-checking using vcsn

This notebook shows how commands from Vcsn can be used to build a spell-checker (actually a spello-fixer).

We first build a minimal automaton (on Nmin\mathbb{N}_\text{min}) to represent a (finite) language. To make it minimal, we built a codeterministic automaton, which we determinize.

import vcsn c = vcsn.context('lan_char, nmin') lang = c.cotrie(filename='/usr/share/dict/words', format='words').determinize().strip()

context.cotrie returns an automaton on letterset (not nullableset), which is a good idea for determinize. Since later we will want the labelset to be nullable, we convert lang to an automaton using context c. The resulting automaton is quite compact: compare the number of states with the number of words and the total number of characters.

lang = lang.automaton(c) lang.info()
{'is codeterministic': 'N/A', 'is complete': 'N/A', 'is deterministic': 'N/A', 'is empty': False, 'is eps-acyclic': True, 'is normalized': False, 'is proper': True, 'is standard': True, 'is trim': True, 'is useless': False, 'is valid': True, 'number of accessible states': 131164, 'number of coaccessible states': 131164, 'number of codeterministic states': 'N/A', 'number of deterministic states': 'N/A', 'number of final states': 15679, 'number of initial states': 1, 'number of lazy states': 0, 'number of spontaneous transitions': 0, 'number of states': 131164, 'number of transitions': 289227, 'number of useful states': 131164, 'type': 'mutable_automaton<nullableset<letterset<char_letters(\\-ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz)>>, nmin>'}
# The number of words, and then the total number of letters. !wc -lc /usr/share/dict/words
235886 2493109 /usr/share/dict/words

Then we turn this automaton into a partial-identity transducer, i.e., each single-tape transition (snass \xrightarrow{\langle n \rangle a} s') is turned into a double-tape transition (snaass \xrightarrow{\langle n \rangle a|a} s'). This is because we want to compose these automata.

lang2 = lang.partial_identity() lang2.info()
{'is codeterministic': 'N/A', 'is complete': 'N/A', 'is deterministic': 'N/A', 'is empty': False, 'is eps-acyclic': True, 'is normalized': False, 'is proper': True, 'is standard': True, 'is trim': True, 'is useless': False, 'is valid': True, 'number of accessible states': 131164, 'number of coaccessible states': 131164, 'number of codeterministic states': 'N/A', 'number of deterministic states': 'N/A', 'number of final states': 15679, 'number of initial states': 1, 'number of lazy states': 0, 'number of spontaneous transitions': 0, 'number of states': 131164, 'number of transitions': 289227, 'number of useful states': 131164, 'type': 'mutable_automaton<lat<nullableset<letterset<char_letters(\\-ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz)>>, nullableset<letterset<char_letters(\\-ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz)>>>, nmin>'}

Now we build the Levenhstein automaton for the same context.

lev = lang2.context().levenshtein() lev
Image in a Jupyter notebook

Compose the Levenshtein automaton to the language transducer. This results into a transducer which on the left-tape is accepting the language fuzzily, and on the second tape, features the exact words.

fixer = lev.compose(lang2).strip() fixer.info()
{'is codeterministic': 'N/A', 'is complete': 'N/A', 'is deterministic': 'N/A', 'is empty': False, 'is eps-acyclic': True, 'is normalized': False, 'is proper': True, 'is standard': False, 'is trim': True, 'is useless': False, 'is valid': True, 'number of accessible states': 131164, 'number of coaccessible states': 131164, 'number of codeterministic states': 'N/A', 'number of deterministic states': 'N/A', 'number of final states': 15679, 'number of initial states': 1, 'number of lazy states': 0, 'number of spontaneous transitions': 0, 'number of states': 131164, 'number of transitions': 22569950, 'number of useful states': 131164, 'type': 'mutable_automaton<lat<nullableset<letterset<char_letters(\\-ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz)>>, nullableset<letterset<char_letters(\\-ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz)>>>, nmin>'}

For any candidate word, build the linear transducer that represents it.

w = c.expression('automathon').automaton().partial_identity() w
Image in a Jupyter notebook

Now compose the word and the fixer to get a computation of the Levenstheim distance of w to any word on the language.

fix = w.compose(fixer) fix.info()
{'is codeterministic': 'N/A', 'is complete': 'N/A', 'is deterministic': 'N/A', 'is empty': False, 'is eps-acyclic': True, 'is normalized': False, 'is proper': True, 'is standard': True, 'is trim': True, 'is useless': False, 'is valid': True, 'number of accessible states': 2754434, 'number of coaccessible states': 2754434, 'number of codeterministic states': 'N/A', 'number of deterministic states': 'N/A', 'number of final states': 31358, 'number of initial states': 1, 'number of lazy states': 0, 'number of spontaneous transitions': 0, 'number of states': 2754434, 'number of transitions': 14060199, 'number of useful states': 2754434, 'type': 'tuple_automaton<mutable_automaton<lat<nullableset<letterset<char_letters(\\-ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz)>>, nullableset<letterset<char_letters(\\-ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz)>>>, nmin>, focus_automaton<1, mutable_automaton<lat<nullableset<letterset<char_letters(\\-ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz)>>, nullableset<letterset<char_letters(\\-ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz)>>>, nmin>>, insplit_automaton<focus_automaton<0, mutable_automaton<lat<nullableset<letterset<char_letters(\\-ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz)>>, nullableset<letterset<char_letters(\\-ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz)>>>, nmin>>>>'}

Display the proposed correction.

fix.lightest(1)

1automathonautomaton\left\langle 1\right\rangle \mathit{automathon}|\mathit{automaton}

Find the lightest path in this transducer: it maps our word to its closest word in the language.

fix.lightest_automaton()
Image in a Jupyter notebook

We can ask for several proposals of close words.

fix.lightest(10).project(1)

3automat3autobahn3automata3autophon3automatic2automatin1automaton3automelon3automotor2autochthon\left\langle 3\right\rangle \mathit{automat} \oplus \left\langle 3\right\rangle \mathit{autobahn} \oplus \left\langle 3\right\rangle \mathit{automata} \oplus \left\langle 3\right\rangle \mathit{autophon} \oplus \left\langle 3\right\rangle \mathit{automatic} \oplus \left\langle 2\right\rangle \mathit{automatin} \oplus \left\langle 1\right\rangle \mathit{automaton} \oplus \left\langle 3\right\rangle \mathit{automelon} \oplus \left\langle 3\right\rangle \mathit{automotor} \oplus \left\langle 2\right\rangle \mathit{autochthon}