Command Line Tools for Natural Language Processing
Published on 11 Oct 2018
As the saying goes; *don't use a sledgehammer to crack a nut*, many tasks in natural language processing can be accomplished with simple unix tools such as sed and awk. ### 1. Randomly Pick Lines from a Dataset We have a large dataset of say 1million lines. How do we pick half a million lines randomly from that file? shuf -n500000 file.txt ### 2. Combine Two Files Side by Side Join two files to a single file separated by a tab. paste train.en train.hi > train.pair Note: To concatenate files vertically use cat command. ### 3. Split Parallel Corpus to Two Files This is the opposite of the task above. Two separate a single file having multiple tab separated columns. cut -f1 train.pair > train.en cut -f2 train.pair > train.hi Note: if we want to get substring from a string, then use cut -c . Also we can split a string using -d flag. e.g. echo 'ml-segmented/42_Mann_Ki_Baat_September_2015.ml' | cut -d\/ -f 2 | cut -c1-2 outputs the number 42. ### 4. Split into Train and Dev sets Usually the validation set (a.k.a development set) is a subset of training data, set aside for validation purposes such as hyper-parameter selection. Say we want to set aside 4% of the train set as dev set, then; awk '{if (NR%25 == 0) print $0; }' train.en-hi > valid.en awk '{if (NR%25 != 0) print$0; }' train.en-hi > train.en ### 5. Sort Lines by Fields Say we want to sort the lines in a file by the second column which is a numeric value. (without option *-n*, it sorts lexicographically! sort -n -k 2 filename can use *-r* option to sort in the reverse order ### 6. Paste grep outputs Say we want to combine outputs from two different grep commands side by side into a single file. paste <(grep ^H train.gen.out) <(grep ^T train.gen.out) | sort -n -k 2 > outfile In the above example we use *process substitution* to combine lines starting with H and T side by side then sort the lines by the second column. ### 7. Pick lines upto N From a large file, extract first N lines to another file awk 'NR <= 1000' file.txt > outfile.txt ### 8. Extract a particular pattern from a line Say we have each line in train.perpl in the following format and we want to extract only the Total values; I=28 2 -1.025943 am=1083 3 -0.9873719 Binu=0 1 -7.021167 Jasim=0 1 -5.8004203 .=2001 1 -3.8583121 =2 2 -0.23982812 Total: -18.933043 OOV: 2 grep -o 'Total: [^ ]* ' train.perpl *-o* stands for *only*. ### 9. Convert lower case to upper case tr “[a-z]” “[A-Z]” < file.txt or tr '[:upper:]' '[:lower:]' < file.en > file.norm.en ### 10. Squeeze repetition of characters echo "Welcome To Unix" | tr -s [:space:] ' ' # Welcome To Unix ### 11. Remove all the digits from the string  echo "my ID is 73535" | tr -d [:digit:] Note: echo "my ID is 73535" | tr -cd [:digit:] gives the complement of above: 73535 Note: The tr (translate) tool is quite handy: see: [Geek4Geek](https://www.geeksforgeeks.org/tr-command-unixlinux-examples/) for more uses of *tr* ### 12. Count number of lines wc -l filename.txt could give 1 less the actual line number if the last line in the file is not ended with a new line character. Hence confirm the line number with awk 'END {print NR}' filename.txt Add a newline at the end of file using sed -i -e '$a\' filename.txt ### 13. Count Total Number of Words in a Directory Let's add up total number of words in all the text files in a directory num=0 for f in *; do ((num +=$(wc -w $f | awk '{print$1}'))); done echo $num ### 14. Add a column to a file Let's say we want to add the line number as the first column to a file: awk '{print NR,$0}' filename.txt Note that *$0* refers to the entire line while *$1, $2* etc. refers to the first column, second column etc. Using sed: prefix="__$lang"; prefix+="__" sed -i "s/^/$prefix /" file.txt Note that double quotes is necessary to evaluate variable. *-i* is to replace in-place. ### 15. Reverse Words in a Sentence It's useful to reverse Urdu words to get them in left to right order for training a machine translation model probably. How to reverse the order of words? awk '{for(i=NF;i>=1;i--) printf "%s ",$i; print ""}' input.txt Note *NF* denotes number of fields (words) ### 16. Rename Multiple Files Let's say we rename all files in the directory by appending *_wat* before the file extension, here is a bash script to do it: for f in *; do mv $f$(echo $f | sed 's/\./_wat./g') ; done Using a backquote instead of the *$* sign is also fine to evaluate the echo command, but *$* can be nested in more complicated scripts ### 17. Recursively search for files with a specific extension find$DIR -type f -name "*.py" Recursively Copy: find . -name "*.bpe.*" -exec cp {} to_dir/ \; ### 18. Paste Outputs of Two commands paste <(cat l1.txt | cut -c1-4) <(cat l2.txt | cut -c1-4) > lol.txt ### 19. Replace a String with a given String  ${parameter/pattern/string} Note: To replace all occurrences, use ${parameter//pattern/string}