Machine Translation with Moses
Published on 02 Nov 2018
Moses is a PBMT (phrase based machine translation) system developed at Edinburgh university. An example of its continuing relevance is that the unsupervised machine translation system from Facebook, [Monoses](https://github.com/artetxem/monoses) uses Moses for training. Though compared to NMT, SMT is usually slower to train, trained models require more storage space as it keeps huge phrase tables and also doesn't perform as good as the latest NMT models if provided with large amounts of parallel data. ## Links to Tutorials - Most of the following code snippets are taken from the [official Moses tutorial](http://www.statmt.org/moses/?n=Moses.Baseline) - [A tutorial by Anoop](https://www.cse.iitb.ac.in/~anoopk/publications/presentations/moses_giza_intro.pdf) - Also see a [link to mgiza tutorial](https://fabioticconi.wordpress.com/2011/01/17/how-to-do-a-word-alignment-with-giza-or-mgiza-from-parallel-corpus/) ## Data Preprocessing ### Tokenization ``` ~/lib/mosesdecoder/scripts/tokenizer/tokenizer.perl -l en < ilci.en > ilci.tok.en python ~/lib/indicnlp/indicnlp/tokenize/indic_tokenize.py ilci.hi ilci.tok.hi hi ``` ### Truecasing or lowercasing? Let's stick with lower casing. tr '[:upper:]' '[:lower:]' < ilci.tok.en > ilci.norm.en ### Normalize Hindi python ~/lib/indicnlp/indicnlp/normalize/indic_normalize.py ilci.tok.hi ilci.norm.hi hi ### Clean Sentences Limiting sentence length to 80 ``` ~/lib/mosesdecoder/scripts/training/clean-corpus-n.perl ilci.norm hi en ilci.clean 1 80 ``` ## KenLM (Train a Language Model) ``` ~/lib/kenlm/bin/lmplz -o 4 < ilci.clean.en > ilci.arpa.en ~/lib/kenlm/bin/build_binary ilci.arpa.en ilci.blm.en ``` ### Querying LM `echo "is this an English sentence ?" | ~/lib/kenlm/bin/query ilci.blm.en` ### Python Package for KenLM ``` import kenlm model = kenlm.Model('lm/test.arpa') print(model.score('this is a sentence .', bos = True, eos = True)) ``` ## Training ``` ROOT=/ssd_scratch/cvit/binu.jasim/moses/ MOSES=~/lib/mosesdecoder/ nohup nice $MOSES/scripts/training/train-model.perl \ -root-dir $ROOT -corpus ilci.clean -f hi -e en \ -alignment grow-diag-final-and -reordering msd-bidirectional-fe \ -lm 0:4:/home/binu.jasim/playground/may19/pbmt/ilci.blm.en:8 \ -mgiza -mgiza-cpus 0 -external-bin-dir ~/lib/mgizapp/bin/ &> train.out & ``` ### Fine Tuning (Optional) We need a small parallel corpus for tuning. Don't forget to normalize & clean it as described above. We also `cd` to the root directory as `mert-work` directory is created in the `pwd`. ``` cd $ROOT LDIR=`pwd` $MOSES/scripts/training/mert-moses.pl \ $LDIR/dev.clean.hi $LDIR/dev.clean.en $MOSES/bin/moses $ROOT/model/moses.ini \ --mertdir $MOSES/bin/ \ --decoder-flags="-threads 8" &> mert.out ``` This takes a lot of time (~1 hour for 500 sentence tune set). If tuned, don't forget to point to the `mert-work/moses.ini ` while testing. ## Testing ``` TDIR=test-dir/iitb/hi-en/ ~/lib/mosesdecoder/bin/moses -f $ROOT/model/moses.ini < $TDIR/test.norm.hi \ > $TDIR/test.translated.en 2> $TDIR/test.out ``` Test BLEU score ``` ~/lib/mosesdecoder/scripts/generic/multi-bleu.perl \ -lc $TDIR/test.norm.en < $TDIR/test.translated.en ``` Testing can be sped up by filtering the phrase table suited only for the test set. (Filtering reduced speed from 11m to 2m on the iitb test set. Another guaranteed way to speed up inference is to binarize phrase table, but it requires re-installation of moses with CMPH). (see a test script [here](https://github.com/bnjasim/research/blob/master/2019/05_may/run_moses_test.sh)) # MGIZA >> Note: All of the following steps can be avoided. Just use Moses pipeline up to step 6. See the code snippet at the end. The different steps of mgiza is automatically taken care by moses. ### Making classes ~/lib/mgizapp/bin/mkcls -n10 -pilci.clean.en -Vilci.clean.en.vcb.classes surprising to me that there shouldn't be any space after `-p`, `-V` etc. ### Convert Corpus into Giza Format ~/lib/mgizapp/bin/plain2snt ilci.clean.hi ilci.clean.en ### Create Co-occurrence ``` ~/lib/mgizapp/bin/snt2cooc ilci.clean.hi_ilci.clean.en.cooc \ ilci.clean.hi.vcb ilci.clean.en.vcb \ ilci.clean.hi_ilci.clean.en.snt ``` ### Finally Aligning! Use the configuration file [here](https://github.com/bnjasim/research/blob/master/pbmt/configfile) run `~/lib/mgiza/bin/mgiza configfile` (Note the *ncpus* in the configfile) This final step took less than 5 minutes (for 50k parallel data) with 10 cpus. Other previous steps are also fast. Finally we have to concatenate all the files `hi_en.dict.A3.final.part000`. There will be as many files as *ncpus*. ### Output The final output looks something like this: ``` # Sentence pair (9) source length 4 target length 5 alignment score : 2.60427e-08 drink plenty of water . NULL ({ }) खूब ({ 2 3 }) पानी ({ 4 }) पीएँ ({ 1 }) । ({ 5 }) ``` ### Not Done Yet! If we want the word alignments, then we have to continue till step 6 of the [Moses pipeline](http://www.statmt.org/moses/?n=Moses.Baseline): Hence specify the `--last-step 6` in stead of the default *7* as we don't need the reordering step. ``` ~/lib/mosesdecoder/scripts/training/train-model.perl \ -root-dir /ssd_scratch/cvit/binu.jasim/moses/ -corpus ilci.clean \ -f hi -e en -alignment grow-diag-final-and \ -mgiza -mgiza-cpus 0 -external-bin-dir ~/lib/mgizapp/bin/ \ -last-step 6 -score-options '--GoodTuring' ``` *(Note that cpu 0 tells to utilize all the available cpus)*