Published on 09 Sep 2018
Tokenization is a standard preprocessing step in NLP where a sentence is split into its constituent tokens such as words, numbers etc. Tools such as Moses and spacy can do it. They seem to support only a few European languages. (For languages such as Hindi, there appears to be some code in github for that. Need to look it up later).
### Method 1: Regular Expressions
An easy way to perform tokenization is to write some simple regular expressions:
# s = unicode_to_ascii(s.lower().strip())
s = re.sub(r"([,.!?])", r" \1 ", s)
s = re.sub(r"[^a-zA-Z,.!?]+", r" ", s)
s = re.sub(r"\s+", r" ", s).strip()
`unicode_to_ascii` is a useful preprocessing step in languages such as French. It helps to convert accented letters to ascii alphabets (if possible)
return ''.join(c for c in unicodedata.normalize('NFD', s)
if unicodedata.category(c) != 'Mn')
The above tokenization fails in many situations. e.g. `U.K.` will be split into four tokens `U . K .` which is not what we want. More advanced tokenizers can handle most of these corner cases.
### Method 2: MOSES Tokenizer
The `tokenizer.perl` file provided by Moses SMT (Koehn et.al.) is a standard tool for tokenization.
OpenNMT-py/tools/tokenizer.perl -a -no-escape -l fr -q < input.txt > output.atok;
### Method 3: Spacy
text = u'Apple is looking at buying U.K. startup for $1 billion'
nlp = spacy.load('en_core_web_sm', disable=['parser', 'tagger', 'ner'])
# spacy.load('en', disable=['parser', 'tagger', 'ner']) is fine
doc = nlp(text)
print(' '.join([token.text for token in doc]))
Output: `Apple is looking at buying U.K. startup for $ 1 billion`
- Spacy tokenizes `Tom's` as `Tom 's` while Moses tokenizes as `Tom' s`. I think I would prefer Spacy.
- BPE (byte pair encoding) probably can't be applied without tokenization as tokenization is important to separate punctuations. BPE/sentencepiece may or may not do this.
- Need to install models for more languages: `spacy download fr`. Run `spacy.load('fr_core_news_sm')` or simply `spacy.load('fr', disable=['parser', 'tagger', 'ner'])`