Tokenization
Published on 09 Sep 2018
Tokenization is a standard preprocessing step in NLP where a sentence is split into its constituent tokens such as words, numbers etc. Tools such as Moses and spacy can do it. They seem to support only a few European languages. (For languages such as Hindi, there appears to be some code in github for that. Need to look it up later). ### Method 1: Regular Expressions An easy way to perform tokenization is to write some simple regular expressions: import re def normalize_string(s): # s = unicode_to_ascii(s.lower().strip()) s = re.sub(r"([,.!?])", r" \1 ", s) s = re.sub(r"[^a-zA-Z,.!?]+", r" ", s) s = re.sub(r"\s+", r" ", s).strip() return s `unicode_to_ascii` is a useful preprocessing step in languages such as French. It helps to convert accented letters to ascii alphabets (if possible) import unicodedata def unicode_to_ascii(s): return ''.join(c for c in unicodedata.normalize('NFD', s) if unicodedata.category(c) != 'Mn') The above tokenization fails in many situations. e.g. `U.K.` will be split into four tokens `U . K .` which is not what we want. More advanced tokenizers can handle most of these corner cases. ### Method 2: MOSES Tokenizer The `tokenizer.perl` file provided by Moses SMT (Koehn et.al.) is a standard tool for tokenization. OpenNMT-py/tools/tokenizer.perl -a -no-escape -l fr -q < input.txt > output.atok; ### Method 3: Spacy import spacy text = u'Apple is looking at buying U.K. startup for $1 billion' nlp = spacy.load('en_core_web_sm', disable=['parser', 'tagger', 'ner']) # spacy.load('en', disable=['parser', 'tagger', 'ner']) is fine doc = nlp(text) print(' '.join([token.text for token in doc])) Output: `Apple is looking at buying U.K. startup for $ 1 billion` ### Remarks - Spacy tokenizes `Tom's` as `Tom 's` while Moses tokenizes as `Tom' s`. I think I would prefer Spacy. - BPE (byte pair encoding) probably can't be applied without tokenization as tokenization is important to separate punctuations. BPE/sentencepiece may or may not do this. - Need to install models for more languages: `spacy download fr`. Run `spacy.load('fr_core_news_sm')` or simply `spacy.load('fr', disable=['parser', 'tagger', 'ner'])`