This is a generic text-processing need that often occurs in different kinds of scripts. Simply put, you want to get a list of the strings in the file (or files) that match a certain pattern. Let's use this simple file as an example:
12345#foobar3#blah xxxxxxx#foobar77#yyyyyy foobar867#zzzzzzz ooooooo#foobar12#ggggggg#foobar17#kkkkkkkk#foobar99 xxxxxxxxxxxxxxxxx somefoobar12thatwedontwant
Our pattern is (using ERE syntax) "foobar[0-9]+", that is, "foobar" followed by any number of digits. We will refine it a bit later.
Using common shell tools, we have several possibilities.
GNU grep
Probably the simplest one, if GNU grep is available, is to use its -o option, to return only the part of the input that matches the pattern, so:
$ grep -Eo 'foobar[0-9]+' test.txt foobar3 foobar77 foobar867 foobar12 foobar17 foobar99 foobar12
As said, this needs GNU grep due to the -o option.
GNU awk and BusyBox awk
These two awk implementations support, as a non-standard extension, the assignment of a regular expression to RS, and make whatever matched RS available in the special variable RT (mawk seems to support the former feature, but not the latter, which make it unsuitable to be used in the way we describe here). So here's how to use these awks for the task:
$ gawk -v RS='foobar[0-9]+' 'RT{print RT}' test.txt foobar3 foobar77 foobar867 foobar12 foobar17 foobar99 foobar12
Note that using RS/RT this way allows to match patterns that contain newlines, something that's not easily achieved with other tools (except Perl, see below).
These methods are easy and quick; however, if none of the above implementations is available, we need to use something more standard.
Standard awk
With standard awk, a way to extract all occurrences is to use a loop over each line, repeatedly using match():
$ cat matches.awk { line = $0 while (match(line, /foobar[0-9]+/) > 0) { print substr(line, RSTART, RLENGTH) line = substr(line, RSTART + RLENGTH) } } $ awk -f matches.awk test.txt foobar3 foobar77 foobar867 foobar12 foobar17 foobar99 foobar12
Here the original line is saved (in case it's needed for further processing) and a copy is used to find matches. Since match() only finds the first match in the string, when a match is found it's removed so running match() again can find the following occurrence (if any). For this reason, the above code will loop forever if it's given a pattern that can match the empty string, like for example a*. When you do that, you really want a+ instead anyway, so use the latter. The code above is a common awk idiom to find all matches of a pattern.
Sed
With sed the task is a bit complicated. Basically, we need to somehow "mark" the parts of the data that match our pattern, so we can later delete everything that's not between markers, leaving thus only what we want.
A safe character to use as marker is the newline character (\n), since sed guarantees that, under normal conditions, no input line as seen in the pattern space will contain that character. For the first of the following solutions to work, a sed implementation that recognize \n in the RHS and the special bracket expression [^\n] (any character except \n) is needed. And since our pattern is a ERE (though it could be rewritten as BRE), we need a sed that recognizes EREs. GNU sed has all these features, and we're going to assume it in the examples.
That said, let's see a couple of ways to solve the task with sed.
One somewhat laborious solution is as follows:
$ sed -E ' s/foobar[0-9]+/\n&/g t ok d :ok s/^[^\n]*\n// s/(foobar[0-9]+)[^\n]*/\1/g' test.txt foobar3 foobar77 foobar867 foobar12 foobar17 foobar99 foobar12
Here we prepend a \n to each match, then delete what's before the very first match in the line (zero or more non-\n followed by a \n at the beginning of the string). Finally we delete all the parts between matches, which leaves us with just the matches, nicely separated by \n characters.
Another approach to the problem is implemented with the following code (which also has the benefit of using standard syntax; changing the ERE into BRE (foobar[0-9][0-9]*) and converting all the "\n" in the RHS to literal escaped newlines would allow this solution to be used with a standard sed):
$ sed -E ' /\n/!s/foobar[0-9]+/\n&\n/g /^foobar[0-9]+\n/P D' test.txt foobar3 foobar77 foobar867 foobar12 foobar17 foobar99 foobar12
Here the approach is to "isolate" each match with a \n before and one after (if the pattern space doesn't already have one). If the line begins with a match, it's printed with "P" (up to the following \n, which is what we want). Regardless, the part up to and including the first \n is deleted (with "D"). If something is left, go to the beginning to do the previous steps again, until the whole pattern space is entirely consumed. If there were no matches in the original line, "D" will just delete it entirely and start a new cycle. Rinse and repeat for every input line.
Perl
With perl we can do it pretty easily thanks to its powerful regular expression matching operators:
$ perl -ne 'print "$_\n" for (/foobar\d+/g);' test.txt foobar3 foobar77 foobar867 foobar12 foobar17 foobar99 foobar12
If the pattern we want has newlines in it, we can just tell perl to slurp the file with perl -n000e and we're set.
Context comes to town
All the solutions seen so far strictly match a pattern, regardless of where it appears. In other words, they ignore the context of the matches. However there may be cases where this is important. In our example input data, we might want to match foobar[0-9]+ only if it's delimited, where "delimited" here is defined as "preceded by either a hash (#) or beginning of line, and followed by either a hash or end of line". Clearly, with this new requirements we don't want the foobar12 in the last line.
We thus need to consider the context in the regular expressions, making them include a larger text, so that matches only happen where there's data that we want; however, since the matched text will now be larger than what we need, we need to subsequently "clean up" the match, extracting only what we want from it. Our regular expression becomes now (ERE syntax)
(^|#)foobar[0-9]+(#|$)
Let's see how to modify the previous solutions to work with context.
GNU grep
Grep can't really edit text, so it would seem like it's out of the discussion here, but with a silly trick we can still use it:
$ grep -Eo '(^|#)foobar[0-9]+(#|$)' test.txt | grep -Eo 'foobar[0-9]+' foobar3 foobar77 foobar867 foobar12 foobar17 foobar99
The first grep prints all matches with their context, and the second one, operating only on the good data, strictly "extracts" the matches that we need.
GNU awk and BusyBox awk
Setting RS to a non-default value obviously causes awk to stop working in line-oriented mode, so the beginning of line and end-of line anchors in our regular expression need to be augmented to consider the newline character.
Now, with the extended RS, RT will contain the full match with context, so we use gsub() to clean it up:
$ gawk -v RS='(^|#|\n)foobar[0-9]+(#|\n|$)' 'RT{gsub(/^(#|\n)|(#|\n)$/, "", RT); print RT}' test.txt foobar3 foobar77 foobar867 foobar12 foobar17 foobar99
The critical part here is obviously the gsub(), which should be written carefully to remove the context stuff and only leave what we want.
Standard awk
Here we don't change RS so we're using the traditional line-oriented mode:
$ cat matches2.awk { line = $0 while (match(line, /(^|#)foobar[0-9]+(#|$)/)>0) { m = substr(line, RSTART, RLENGTH) gsub(/^#|#$/, "", m); print m line = substr(line, RSTART + RLENGTH) } } $ awk -f matches2.awk test.txt foobar3 foobar77 foobar867 foobar12 foobar17 foobar99
Sed
Things start to get complicated with sed if we want context. However we can still do it.
Of the two sed solutions presented previously, the easiest to adapt is the second one, so here it is:
$ sed -E ' /\n/!s/(^|#)foobar[0-9]+(#|$)/\n&\n/g /^#?foobar[0-9]+#?\n/ { s/^#?(foobar[0-9]+)#?/\1/ P } D' test.txt foobar3 foobar77 foobar867 foobar12 foobar17 foobar99
Again, the critical bit is the part where the context (that we needed to match only the "correct" parts, but no longer want) is removed. This part will be highly dependent on the actual input data and problem requirements.
Perl
Perl is again an easy winner, as we can match with context and pull out only the interesting parts in a single go:
$ perl -ne 'print "$_\n" for (/(?:^|#)(foobar\d+)(?:#|$)/g);' test.txt foobar3 foobar77 foobar867 foobar12 foobar17 foobar99
The regular expressions for what comes before and after are non-capturing, so the list returned byt the overall match is already made of clean strings, which we thus just need to print.
Overlap problems
You might have noticed that at the same time we introduced context to the matches, we also introduced the potential for overlap. Consider the following sample input data:
12345#foobar3#foobar9999#blah somefoobar12thatwedontwant
If we run for example the above GNU awk solution on this data, we get:
$ gawk -v RS='(^|#|\n)foobar[0-9]+(#|\n|$)' 'RT{gsub(/^(#|\n)|(#|\n)$/, "", RT); print RT}' test.txt foobar3
The foobar9999 is missed since the regular expression that matches foobar3 also "consumes" its surrounding context (the leading and trailing hash) and thus applying the regex with context again on what's left fails to match the second occurrence of the pattern.
However, this does not happen with all the solutions; only with some of them. The standard awk and the sed solutions still work since the previous match is deleted from the line, and the extended pattern we use to include context works if the match is at the beginning of a line without a delimiter, too. In the example, once #foobar3# has been matched and removed what's left is "^foobar9999#blah$", and the expression we're using for the match can still match again it since the pattern is at the very beginning and ^ is a possible anchor.
Of course, this happens to work because of the specific combination of input data and regular expressions that we're using; generally speaking, this doesn't have to be the case. It will depend on the actual situation.
The modern RE engine answer to safely solve the overlapping context problem is, naturally, lookaround, which turns actual consumed characters into zero-length assertions, and leaves them available for the next match attempt. This means that sed and awk are excluded, since their RE engines do not support lookaround.
What's left is GNU grep (with its -P option to match in PCRE mode, where available), and of course perl.
grep:
$ grep -Po '(?<=^|#)foobar[0-9]+(?=#|$)' test2.txt foobar3 foobar9999
There's also a pcregrep utility that comes with the PCRE library, with a syntax similar to that of grep. In particular, it supports the -o option, se we can also do:
$ pcregrep -o '(?<=^|#)foobar[0-9]+(?=#|$)' test2.txt foobar3 foobar9999
Let's try perl:
$ perl -ne 'print "$_\n" for (/(?<=^|#)(foobar\d+)(?=#|$)/g);' test2.txt Variable length lookbehind not implemented in regex m/(?<=^|#)(foobar\d+)(?=#|$)/ at -e line 1.
Oops...it seems PCRE is more advanced than perl itself in this particular feature. As man pcrepattern informs us,
The contents of a lookbehind assertion are restricted such that all the strings it matches must have a fixed length. However, if there are several top-level alternatives, they do not all have to have the same fixed length. Thus
(?<=bullock|donkey)
is permitted, but
(?<!dogs?|cats?)
causes an error at compile time. Branches that match different length strings are permitted only at the top level of a lookbehind assertion. This is an extension compared with Perl, which requires all branches to match the same length of string. An assertion such as
(?<=ab(c|de))
is not permitted, because its single top-level branch can match two different lengths, but it is acceptable to PCRE if rewritten to use two top-level branches:
(?<=abc|abde)
So what can we do with perl? We have two possibilities.
We note that, strictly speaking, and in this particular case, only what follows the match has to be preserved for the next attempt; the lookbehind is not strictly needed, and we can replace it with a regular match. Thus:
$ perl -ne 'print "$_\n" for (/(?:^|#)(foobar\d+)(?=#|$)/g);' test2.txt foobar3 foobar9999
Another way to solve the problem is a bit ugly, but it works: we can just move the ^ anchor outside the lookbehind and make it part of a regular alternation; since it's a zero-length match anyway, nothing is harmed:
$ perl -ne 'print "$_\n" for (/(?:^|(?<=#))(foobar\d+)(?=#|$)/g);' test2.txt foobar3 foobar9999
It is important to understand that there's no generic rule here, and the solution will necessarily have to depend on the problem at hand. Depending on the actual situation, transforming a variable-length lookbehind into something accepted by perl may not always be so easy (or even possible).