For text processing, I had never bothered to learn classic Unix tools such as sed and awk, because I can always use Python's regular expression library. The syntax of sed
and awk
just appeared to be too arcane to me. However, recently I realize that for many simple ad-hoc tasks, even writing a Python script is too much overhead. This motivated me to learn to use regular expressions directly in the command line.
It also gave me an excuse to play with literate programming in org-mode. I love mixing narratives with code in notebook environments, but with Jupyter and Mathematica notebooks, I have to program in Python, R, or Mathematica. Org-mode has the advantage of supporting many programming languages. Furthermore, code blocks in an org-mode notebook can be written in more than one languages, which makes it perfect for teaching and learning programming languages. Another useful feature is that instead of loading data from an external file, a code block can read from data blocks (and even from the results of other code blocks) in the same notebook. It's very convenient to be able to manipulate data and code in a self-contained document.
If you are reading this on GitHub or on my blog, note that you are viewing a rendering of the notebook, which hides the markups used to construct the code blocks. On GitHub, you can click on the "Raw" button to view the source code. More usefully, you can download the source code and open it in Emacs. You can then move the cursor to one of the code blocks, and evaluate it with C-c C-c
.
Preparations
By default, org-mode in Emacs only allows the evaluation of LISP code blocks. For interacting with the Unix shell and Python, I inserted the following to my Emacs configuration script:
;; Do not ask for confirmation when evaluation a block
(setq org-confirm-babel-evaluate nil)
(org-babel-do-load-languages
'org-babel-load-languages
'((emacs-lisp . t)
(shell . t)
(python . t)))
Running unix shell commands in org-mode
Let's start with this small chunk of text:
apple orange User24 banana User300 User5s cherry strawberry kiwi User65
To store it in a code block, I used the #+BEGIN_EXAMPLE
and #+END_EXAMPLE
markup. I gave it a name (source_text
) with #+NAME:
, so that other code blocks can refer to it. To learn more about the mechanism for passing information around among code blocks, jherrlin has posted several very useful blog entries.
The following is a shell script block. I use the :stdin source_text
directive to send the text to STDIO
. To operate on STDIO
from a shell script, the easiest way is to read from the /dev/stdin
file:
cat /dev/stdin
apple orange User24 banana User300 User5s cherry strawberry
The result above shows that we can indeed read from the source_text
block. Note that the format that the result is displayed in is specified by the shell script block. I used :results output
to tell org-mode to display the raw output of the shell script, instead of formatting it as a table (which is the default behavior). The :exports both
directive was used so that when the notebook is exported to HTML, the code and the result are both exported (by default, only the code is exported).
Simple pattern matching with grep
How to extract all the user IDs (User24
, User300
, User5
, =User65=
) from the string? In Python, it's just:
import re
res = re.findall(r'User\d+', str)
for r in res:
print(r)
User24 User300 User5 User65
To do the same in the Unix shell, grep
is the closest to Python's re.findall()
. It took me a couple of minutes to figure out that I had to use the extended version of grep
, egrep
, because the +
pattern is unavailable in grep
:
cat /dev/stdin | egrep -o 'User\d+'
User24 User300 User5 User65
The -o
option (o
stands for "only") is important, because it tells (e)grep
to print only the substring that matches the regular expression pattern. Interestingly, the matched substrings are displayed in individual lines. This behavior will come in handy later.
Without the -o
flag, (e)grep
prints the entire line if there is a match in the line. This behavior is useful for identifying lines in a file that matches the pattern, but it's not useful for extracting useful pieces in a line.
cat /dev/stdin | egrep 'User\d+'
apple orange User24 banana User300 User5s cherry strawberry kiwi User65
I also tried to solve this problem with sed
, but it didn't seem to be work. sed
is good at editing words in a document, but it is not good at extraction.
Simple pattern matching with awk
This was the first time that I used awk
. I was hoping for an elegant one-liner (what was I thinking?), but it turned out that for each line ($0
), I had to loop through all matches, print out the match, and then update $0
to become the rest of the line. Yuck! If I have to do this, I might as well use Python.
{
while (match($0, "User[0-9]+")) {
print substr($0, RSTART, RLENGTH);
$0 = substr($0, RSTART + RLENGTH);
}
}
User24 User300 User5 User65
Substring extraction with grep
and sed
Let's make the problem a little harder. Consider this chunk of text:
apple orange User24.txt banana User300s User5.text cherry strawberry kiwi User65.gif banana User31.text
Some of the substrings that begin with User
are filenames (e.g., User24.txt
). Form the filenames, I want to extract the parts before the extensions (e.g., the User24
part of User24.txt
). Note that the length of the extension is not constant.
With Python's re.findall()
, this can easily be done by creating a capture group in the pattern with parentheses:
import re
res=re.findall(r'(User\d+).[a-z]{3,4}', str)
for r in res:
print(r)
User24 User5 User65 User31
With grep
, it's easy to pick up the filenames…
cat /dev/stdin | egrep -o 'User\d+.[a-z]{3,4}'
User24.txt User5.text User65.gif User31.text
… but it is not easy to take them apart. That's where sed
comes in! I couldn't solve this problem with sed
alone, because sed
is not good at picking up multiple matches in the same line. But since grep
very helpfully puts each match in is own line, it's perfect for sed
. It's time to try the famous sed
command s/
("substitute"):
sedre='(User[0-9]+).[a-z]{3,4}'
action='\1'
cat /dev/stdin | egrep -o "$sedre" | sed -n -E "s/$sedre/$action/p"
User24 User5 User65 User31
It's a nice one-liner, bit there are a some details to unpack:
The syntax of the
s/
command iss/pattern/action/options
. I defined two variables (sedre
andaction
) to make this structure more obvious, but it wasn't necessary.I asked
sed
to print only the lines that matched. This isn't necessary because every line fromegrep
should match, but it's useful for debugging. This is done with the-n
flag, and with thep
("print") option in last part of the/s
command.I turned on the
-E
flag, to use extended regular expressions.I created a capture group in the pattern
sedre
with parentheses. Normally, the parentheses need to be escaped (i.e.,\(
and\
)), but with the-E
flag on, they shouldn't be escaped.In the
action
part of thes/
command, I used\1
to refer to the first (and only) group in the pattern.
I was planning to solve this with awk
, because awk
is supposed to be good at extracting bits and pieces of information from texts. But awk
is really designed for processing tabular data. For unstructured texts, awk
is worse than Python, so I decided not to bother.