Pulling out strings

Posted by waldner on 22 January 2014, 1:23 pm

This is a generic text-processing need that often occurs in different kinds of scripts. Simply put, you want to get a list of the strings in the file (or files) that match a certain pattern. Let's use this simple file as an example:

12345#foobar3#blah
xxxxxxx#foobar77#yyyyyy
foobar867#zzzzzzz
ooooooo#foobar12#ggggggg#foobar17#kkkkkkkk#foobar99
xxxxxxxxxxxxxxxxx
somefoobar12thatwedontwant

Our pattern is (using ERE syntax) "foobar[0-9]+", that is, "foobar" followed by any number of digits. We will refine it a bit later.

Using common shell tools, we have several possibilities.

GNU grep

Probably the simplest one, if GNU grep is available, is to use its -o option, to return only the part of the input that matches the pattern, so:

$ grep -Eo 'foobar[0-9]+' test.txt
foobar3
foobar77
foobar867
foobar12
foobar17
foobar99
foobar12

As said, this needs GNU grep due to the -o option.

GNU awk and BusyBox awk

These two awk implementations support, as a non-standard extension, the assignment of a regular expression to RS, and make whatever matched RS available in the special variable RT (mawk seems to support the former feature, but not the latter, which make it unsuitable to be used in the way we describe here). So here's how to use these awks for the task:

$ gawk -v RS='foobar[0-9]+' 'RT{print RT}' test.txt
foobar3
foobar77
foobar867
foobar12
foobar17
foobar99
foobar12

Note that using RS/RT this way allows to match patterns that contain newlines, something that's not easily achieved with other tools (except Perl, see below).

These methods are easy and quick; however, if none of the above implementations is available, we need to use something more standard.

Standard awk

With standard awk, a way to extract all occurrences is to use a loop over each line, repeatedly using match():

$ cat matches.awk
{
  line = $0
  while (match(line, /foobar[0-9]+/) > 0) {
    print substr(line, RSTART, RLENGTH)
    line = substr(line, RSTART + RLENGTH)
  }
}
$ awk -f matches.awk test.txt
foobar3
foobar77
foobar867
foobar12
foobar17
foobar99
foobar12

Here the original line is saved (in case it's needed for further processing) and a copy is used to find matches. Since match() only finds the first match in the string, when a match is found it's removed so running match() again can find the following occurrence (if any). For this reason, the above code will loop forever if it's given a pattern that can match the empty string, like for example a*. When you do that, you really want a+ instead anyway, so use the latter. The code above is a common awk idiom to find all matches of a pattern.

Sed

With sed the task is a bit complicated. Basically, we need to somehow "mark" the parts of the data that match our pattern, so we can later delete everything that's not between markers, leaving thus only what we want.

A safe character to use as marker is the newline character (\n), since sed guarantees that, under normal conditions, no input line as seen in the pattern space will contain that character. For the first of the following solutions to work, a sed implementation that recognize \n in the RHS and the special bracket expression [^\n] (any character except \n) is needed. And since our pattern is a ERE (though it could be rewritten as BRE), we need a sed that recognizes EREs. GNU sed has all these features, and we're going to assume it in the examples.

That said, let's see a couple of ways to solve the task with sed.

One somewhat laborious solution is as follows:

$ sed -E '
s/foobar[0-9]+/\n&/g
t ok
d
:ok
s/^[^\n]*\n//
s/(foobar[0-9]+)[^\n]*/\1/g' test.txt
foobar3
foobar77
foobar867
foobar12
foobar17
foobar99
foobar12

Here we prepend a \n to each match, then delete what's before the very first match in the line (zero or more non-\n followed by a \n at the beginning of the string). Finally we delete all the parts between matches, which leaves us with just the matches, nicely separated by \n characters.

Another approach to the problem is implemented with the following code (which also has the benefit of using standard syntax; changing the ERE into BRE (foobar[0-9][0-9]*) and converting all the "\n" in the RHS to literal escaped newlines would allow this solution to be used with a standard sed):

$ sed -E '
/\n/!s/foobar[0-9]+/\n&\n/g
/^foobar[0-9]+\n/P
D' test.txt
foobar3
foobar77
foobar867
foobar12
foobar17
foobar99
foobar12

Here the approach is to "isolate" each match with a \n before and one after (if the pattern space doesn't already have one). If the line begins with a match, it's printed with "P" (up to the following \n, which is what we want). Regardless, the part up to and including the first \n is deleted (with "D"). If something is left, go to the beginning to do the previous steps again, until the whole pattern space is entirely consumed. If there were no matches in the original line, "D" will just delete it entirely and start a new cycle. Rinse and repeat for every input line.

Perl

With perl we can do it pretty easily thanks to its powerful regular expression matching operators:

$ perl -ne 'print "$_\n" for (/foobar\d+/g);' test.txt
foobar3
foobar77
foobar867
foobar12
foobar17
foobar99
foobar12

If the pattern we want has newlines in it, we can just tell perl to slurp the file with perl -n000e and we're set.

Context comes to town

All the solutions seen so far strictly match a pattern, regardless of where it appears. In other words, they ignore the context of the matches. However there may be cases where this is important. In our example input data, we might want to match foobar[0-9]+ only if it's delimited, where "delimited" here is defined as "preceded by either a hash (#) or beginning of line, and followed by either a hash or end of line". Clearly, with this new requirements we don't want the foobar12 in the last line.

We thus need to consider the context in the regular expressions, making them include a larger text, so that matches only happen where there's data that we want; however, since the matched text will now be larger than what we need, we need to subsequently "clean up" the match, extracting only what we want from it. Our regular expression becomes now (ERE syntax)

(^|#)foobar[0-9]+(#|$)

Let's see how to modify the previous solutions to work with context.

GNU grep

Grep can't really edit text, so it would seem like it's out of the discussion here, but with a silly trick we can still use it:

$ grep -Eo '(^|#)foobar[0-9]+(#|$)' test.txt | grep -Eo 'foobar[0-9]+'
foobar3
foobar77
foobar867
foobar12
foobar17
foobar99

The first grep prints all matches with their context, and the second one, operating only on the good data, strictly "extracts" the matches that we need.

GNU awk and BusyBox awk

Setting RS to a non-default value obviously causes awk to stop working in line-oriented mode, so the beginning of line and end-of line anchors in our regular expression need to be augmented to consider the newline character.

Now, with the extended RS, RT will contain the full match with context, so we use gsub() to clean it up:

$ gawk -v RS='(^|#|\n)foobar[0-9]+(#|\n|$)' 'RT{gsub(/^(#|\n)|(#|\n)$/, "", RT); print RT}' test.txt
foobar3
foobar77
foobar867
foobar12
foobar17
foobar99

The critical part here is obviously the gsub(), which should be written carefully to remove the context stuff and only leave what we want.

Standard awk

Here we don't change RS so we're using the traditional line-oriented mode:

$ cat matches2.awk
{
  line = $0
  while (match(line, /(^|#)foobar[0-9]+(#|$)/)>0) {
    m = substr(line, RSTART, RLENGTH)
    gsub(/^#|#$/, "", m); print m
    line = substr(line, RSTART + RLENGTH)
  }
}
$ awk -f matches2.awk test.txt
foobar3
foobar77
foobar867
foobar12
foobar17
foobar99

Sed

Things start to get complicated with sed if we want context. However we can still do it.

Of the two sed solutions presented previously, the easiest to adapt is the second one, so here it is:

$ sed -E '
/\n/!s/(^|#)foobar[0-9]+(#|$)/\n&\n/g
/^#?foobar[0-9]+#?\n/ {
  s/^#?(foobar[0-9]+)#?/\1/
  P
}
D' test.txt
foobar3
foobar77
foobar867
foobar12
foobar17
foobar99

Again, the critical bit is the part where the context (that we needed to match only the "correct" parts, but no longer want) is removed. This part will be highly dependent on the actual input data and problem requirements.

Perl

Perl is again an easy winner, as we can match with context and pull out only the interesting parts in a single go:

$ perl -ne 'print "$_\n" for (/(?:^|#)(foobar\d+)(?:#|$)/g);' test.txt
foobar3
foobar77
foobar867
foobar12
foobar17
foobar99

The regular expressions for what comes before and after are non-capturing, so the list returned byt the overall match is already made of clean strings, which we thus just need to print.

Overlap problems

You might have noticed that at the same time we introduced context to the matches, we also introduced the potential for overlap. Consider the following sample input data:

12345#foobar3#foobar9999#blah
somefoobar12thatwedontwant

If we run for example the above GNU awk solution on this data, we get:

$ gawk -v RS='(^|#|\n)foobar[0-9]+(#|\n|$)' 'RT{gsub(/^(#|\n)|(#|\n)$/, "", RT); print RT}' test.txt
foobar3

The foobar9999 is missed since the regular expression that matches foobar3 also "consumes" its surrounding context (the leading and trailing hash) and thus applying the regex with context again on what's left fails to match the second occurrence of the pattern.

However, this does not happen with all the solutions; only with some of them. The standard awk and the sed solutions still work since the previous match is deleted from the line, and the extended pattern we use to include context works if the match is at the beginning of a line without a delimiter, too. In the example, once #foobar3# has been matched and removed what's left is "^foobar9999#blah$", and the expression we're using for the match can still match again it since the pattern is at the very beginning and ^ is a possible anchor.
Of course, this happens to work because of the specific combination of input data and regular expressions that we're using; generally speaking, this doesn't have to be the case. It will depend on the actual situation.

The modern RE engine answer to safely solve the overlapping context problem is, naturally, lookaround, which turns actual consumed characters into zero-length assertions, and leaves them available for the next match attempt. This means that sed and awk are excluded, since their RE engines do not support lookaround.

What's left is GNU grep (with its -P option to match in PCRE mode, where available), and of course perl.

grep:

$ grep -Po '(?<=^|#)foobar[0-9]+(?=#|$)' test2.txt
foobar3
foobar9999

There's also a pcregrep utility that comes with the PCRE library, with a syntax similar to that of grep. In particular, it supports the -o option, se we can also do:

$ pcregrep -o '(?<=^|#)foobar[0-9]+(?=#|$)' test2.txt
foobar3
foobar9999

Let's try perl:

$ perl -ne 'print "$_\n" for (/(?<=^|#)(foobar\d+)(?=#|$)/g);' test2.txt
Variable length lookbehind not implemented in regex m/(?<=^|#)(foobar\d+)(?=#|$)/ at -e line 1.

Oops...it seems PCRE is more advanced than perl itself in this particular feature. As man pcrepattern informs us,

The contents of a lookbehind assertion are restricted such that all the strings it matches must have a fixed length. However, if there are several top-level alternatives, they do not all have to have the same fixed length. Thus

(?<=bullock|donkey)

is permitted, but

(?<!dogs?|cats?)

causes an error at compile time. Branches that match different length strings are permitted only at the top level of a lookbehind assertion. This is an extension compared with Perl, which requires all branches to match the same length of string. An assertion such as

(?<=ab(c|de))

is not permitted, because its single top-level branch can match two different lengths, but it is acceptable to PCRE if rewritten to use two top-level branches:

(?<=abc|abde)

So what can we do with perl? We have two possibilities.

We note that, strictly speaking, and in this particular case, only what follows the match has to be preserved for the next attempt; the lookbehind is not strictly needed, and we can replace it with a regular match. Thus:

$ perl -ne 'print "$_\n" for (/(?:^|#)(foobar\d+)(?=#|$)/g);' test2.txt
foobar3
foobar9999

Another way to solve the problem is a bit ugly, but it works: we can just move the ^ anchor outside the lookbehind and make it part of a regular alternation; since it's a zero-length match anyway, nothing is harmed:

$ perl -ne 'print "$_\n" for (/(?:^|(?<=#))(foobar\d+)(?=#|$)/g);' test2.txt
foobar3
foobar9999

It is important to understand that there's no generic rule here, and the solution will necessarily have to depend on the problem at hand. Depending on the actual situation, transforming a variable-length lookbehind into something accepted by perl may not always be so easy (or even possible).

Filed under awk, faq, linux, sed, shell, tips, worksforme Tagged awk, grep, perl, sed, string extraction, text processing

Comments are closed | Permalink

\1