Tokenization
Published on 09 Sep 2018
Tokenization is a standard preprocessing step in NLP where a sentence is split into its constituent tokens such as words, numbers etc. Tools such as Moses and spacy can do it. They seem to support only a few European languages. (For languages such as Hindi, there appears to be some code in github for that. Need to look it up later). ### Method 1: Regular Expressions An easy way to perform tokenization is to write some simple regular expressions: import re def normalize_string(s): # s = unicode_to_ascii(s.lower().strip()) s = re.sub(r"([,.!?])", r" \1 ", s) s = re.sub(r"[^a-zA-Z,.!?]+", r" ", s) s = re.sub(r"\s+", r" ", s).strip() return s unicode_to_ascii is a useful preprocessing step in languages such as French. It helps to convert accented letters to ascii alphabets (if possible) import unicodedata def unicode_to_ascii(s): return ''.join(c for c in unicodedata.normalize('NFD', s) if unicodedata.category(c) != 'Mn') The above tokenization fails in many situations. e.g. U.K. will be split into four tokens U . K . which is not what we want. More advanced tokenizers can handle most of these corner cases. ### Method 2: MOSES Tokenizer The tokenizer.perl file provided by Moses SMT (Koehn et.al.) is a standard tool for tokenization. OpenNMT-py/tools/tokenizer.perl -a -no-escape -l fr -q < input.txt > output.atok; ### Method 3: Spacy import spacy text = u'Apple is looking at buying U.K. startup for $1 billion' nlp = spacy.load('en_core_web_sm', disable=['parser', 'tagger', 'ner']) # spacy.load('en', disable=['parser', 'tagger', 'ner']) is fine doc = nlp(text) print(' '.join([token.text for token in doc])) Output: Apple is looking at buying U.K. startup for$ 1 billion ### Remarks - Spacy tokenizes Tom's as Tom 's while Moses tokenizes as Tom' s. I think I would prefer Spacy. - BPE (byte pair encoding) probably can't be applied without tokenization as tokenization is important to separate punctuations. BPE/sentencepiece may or may not do this. - Need to install models for more languages: spacy download fr. Run spacy.load('fr_core_news_sm') or simply spacy.load('fr', disable=['parser', 'tagger', 'ner'])