Contact
CoCalc Logo Icon
StoreFeaturesDocsShareSupport News AboutSign UpSign In
| Download
Views: 45896
Kernel: Python 3

Sms2fr

Translate text message abbreviations to actual french.

This program is based on Rewriting the orthography of SMS messages, François Yvon, In Natural Language Engineering, volume 16, 2010. It uses two specific automata created using a machine learning method.

  • The first one is used to add the different syllable with a similar sound as those from the text (graphemic model).

  • The second one treats the probabilities of those changes considering the syntax of the text (syntactic model).

Notes:

  • The following example does not rely on actual weights and labels from the algorithm and these values were chosen as theoritical values.

  • The characters '#', '[' and ']' are considered special characters by the trained automata and not accepted in the original text.

References:

The method

import vcsn ctx = vcsn.context('lal_char, nmin')

Considering we enter 'bo' in the sms2fr algorithm, the following automaton is created:

ctx.expression('bo').automaton()
Image in a Jupyter notebook

It is then composed with the graphemic model (weights will depend on frequency of usage in the language):

ctx.expression('b(<1>o+<2>au+<2>eau)').standard()
Image in a Jupyter notebook

And it is finally composed with the syntactic model (weights will depend on probability of presence of this grapheme after the letter b):

res = ctx.expression('b(<2>o+<3>au+<1>eau)').standard() res
Image in a Jupyter notebook

The algorithm will then choose the path with the lightest weight (in the tropical weightset Rmin), here: 'beau'.

res.lightest()

1beau\left\langle 1\right\rangle \mathit{beau}

Initialization

Unfortunately, for lack of a binary format for automata, loading these automata takes about ten seconds.

import re graphs_dir = vcsn.config('configuration.datadir') + '/sms2fr/' # Read the graphemic automaton. grap = vcsn.automaton(filename=graphs_dir + 'graphemic.efsm') # Read the syntactic automaton. synt = vcsn.automaton(filename=graphs_dir + 'syntactic.efsm').partial_identity()

Core algorithm

def sms2fr(sms, k=1): # Graphemic automaton expects input of the format '[#this#is#my#text#]'. sms = re.escape('[#' + sms.replace(' ', '#') + '#]') ctx = vcsn.context('lan_char, rmin') # Translate the sms into an automaton. sms_aut = ctx.expression(sms).automaton().partial_identity().proper() # First composition with graphemic automaton. aut_g = sms_aut.compose(grap).coaccessible().strip() # Second composition with syntactic automaton. aut_s = aut_g.compose(synt).coaccessible().strip().project(1).proper() # Retrieve the most likely path to correspond to french translation. return aut_s.lightest(k)

Examples

sms2fr('slt')

24.5659[#salut#]\left\langle 24.5659\right\rangle \mathit{[\#salut\#]}

sms2fr('bjr')

30.8694[#bonjour#]\left\langle 30.8694\right\rangle \mathit{[\#bonjour\#]}

sms2fr('on svoit 2m1 ?')

190.602[#on#se#voit#demain#?#]\left\langle 190.602\right\rangle \mathit{[\#on\#se\#voit\#demain\#?\#]}

sms2fr('t pa mal1 100 ton mento')

201.495[#t#es#pas#malin#sans#ton#manteau#]\left\langle 201.495\right\rangle \mathit{[\#t'\#es\#pas\#malin\#sans\#ton\#manteau\#]}

It is also possible to get multiple traduction proposition by using Vcsn's implementations of K shortest path algorithms.

sms2fr('bjr mond', 3)

96.7537[#bonjour#mont#]65.9137[#bonjour#monde#]97.0141[#bonjour#mondes#]\left\langle 96.7537\right\rangle \mathit{[\#bonjour\#mont\#]} \oplus \left\langle 65.9137\right\rangle \mathit{[\#bonjour\#monde\#]} \oplus \left\langle 97.0141\right\rangle \mathit{[\#bonjour\#mondes\#]}

sms2fr('tante', 3)

69.6353[#tant#]56.3297[#tante#]63.2217[#tantes#]\left\langle 69.6353\right\rangle \mathit{[\#tant\#]} \oplus \left\langle 56.3297\right\rangle \mathit{[\#tante\#]} \oplus \left\langle 63.2217\right\rangle \mathit{[\#tantes\#]}