Command Line Tools for Natural Language Processing
Published on 11 Oct 2018
As the saying goes; *don't use a sledgehammer to crack a nut*, many tasks in natural language processing can be accomplished with simple unix tools such as `sed` and `awk`. ### 1. Randomly Pick Lines from a Dataset We have a large dataset of say 1million lines. How do we pick half a million lines randomly from that file? `shuf -n500000 file.txt` ### 2. Combine Two Files Side by Side Join two files to a single file separated by a tab. `paste train.en train.hi > train.pair` Note: To concatenate files vertically use `cat` command. ### 3. Split Parallel Corpus to Two Files This is the opposite of the task above. Two separate a single file having multiple tab separated columns. cut -f1 train.pair > train.en cut -f2 train.pair > train.hi `Note`: if we want to get substring from a string, then use `cut -c` . Also we can split a string using `-d` flag. e.g. echo 'ml-segmented/' | cut -d\/ -f 2 | cut -c1-2 outputs the number 42. ### 4. Split into Train and Dev sets Usually the validation set (a.k.a development set) is a subset of training data, set aside for validation purposes such as hyper-parameter selection. Say we want to set aside 4% of the train set as dev set, then; awk '{if (NR%25 == 0) print $0; }' train.en-hi > valid.en awk '{if (NR%25 != 0) print $0; }' train.en-hi > train.en ### 5. Sort Lines by Fields Say we want to sort the lines in a file by the second column which is a numeric value. (without option *-n*, it sorts lexicographically! `sort -n -k 2 filename` can use *-r* option to sort in the reverse order ### 6. Paste grep outputs Say we want to combine outputs from two different grep commands side by side into a single file. `paste <(grep ^H train.gen.out) <(grep ^T train.gen.out) | sort -n -k 2 > outfile` In the above example we use *process substitution* to combine lines starting with H and T side by side then sort the lines by the second column. ### 7. Pick lines upto N From a large file, extract first N lines to another file `awk 'NR <= 1000' file.txt > outfile.txt` ### 8. Extract a particular pattern from a line Say we have each line in train.perpl in the following format and we want to extract only the Total values; I=28 2 -1.025943 am=1083 3 -0.9873719 Binu=0 1 -7.021167 Jasim=0 1 -5.8004203 .=2001 1 -3.8583121 =2 2 -0.23982812 Total: -18.933043 OOV: 2 `grep -o 'Total: [^ ]* ' train.perpl` *-o* stands for *only*. ### 9. Convert lower case to upper case `tr “[a-z]” “[A-Z]” < file.txt` or `tr '[:upper:]' '[:lower:]' < file.en > file.norm.en` ### 10. Squeeze repetition of characters echo "Welcome To Unix" | tr -s [:space:] ' ' # Welcome To Unix ### 11. Remove all the digits from the string ` echo "my ID is 73535" | tr -d [:digit:]` Note: `echo "my ID is 73535" | tr -cd [:digit:]` gives the complement of above: `73535` Note: The `tr` (translate) tool is quite handy: see: [Geek4Geek]( for more uses of *tr* ### 12. Count number of lines `wc -l filename.txt` could give 1 less the actual line number if the last line in the file is not ended with a new line character. Hence confirm the line number with `awk 'END {print NR}' filename.txt` Add a newline at the end of file using `sed -i -e '$a\' filename.txt` ### 13. Count Total Number of Words in a Directory Let's add up total number of words in all the text files in a directory num=0 for f in *; do ((num += $(wc -w $f | awk '{print $1}'))); done echo $num ### 14. Add a column to a file Let's say we want to add the line number as the first column to a file: awk '{print NR, $0}' filename.txt Note that *$0* refers to the entire line while *$1, $2* etc. refers to the first column, second column etc. Using `sed`: prefix="__$lang"; prefix+="__" sed -i "s/^/$prefix /" file.txt Note that double quotes is necessary to evaluate variable. *-i* is to replace in-place. ### 15. Reverse Words in a Sentence It's useful to reverse Urdu words to get them in left to right order for training a machine translation model probably. How to reverse the order of words? `awk '{for(i=NF;i>=1;i--) printf "%s ", $i; print ""}' input.txt` Note *NF* denotes number of fields (words) ### 16. Rename Multiple Files Let's say we rename all files in the directory by appending *_wat* before the file extension, here is a bash script to do it: for f in *; do mv $f $(echo $f | sed 's/\./_wat./g') ; done Using a backquote instead of the *$* sign is also fine to evaluate the echo command, but *$* can be nested in more complicated scripts ### 17. Recursively search for files with a specific extension `find $DIR -type f -name "*.py"` Recursively Copy: `find . -name "*.bpe.*" -exec cp {} to_dir/ \;` ### 18. Paste Outputs of Two commands `paste <(cat l1.txt | cut -c1-4) <(cat l2.txt | cut -c1-4) > lol.txt` ### 19. Replace a String with a given String ` ${parameter/pattern/string}` Note: To replace all occurrences, use `${parameter//pattern/string}`