Text replacement in context/out of context

Posted by waldner on 4 August 2012, 10:52 am

Ok the title isn't the best one but essentially the problem here is: I want to replace FOO with BAR, but only if FOO is (or is not) part of a text in brackets (this is just an example, although it seems to be a common occurring case; the point is that it must or must not be in a certain context). So, in this example:

abcd FOO efgh [this FOO is in brackets] ijkl FOO [another FOO in brackets]

The output should be either

abcd FOO efgh [this BAR is in brackets] ijkl FOO [another BAR in brackets]

abcd BAR efgh [this FOO is in brackets] ijkl BAR [another FOO in brackets]

depending on whether we want the in-context or out-of context replacement.

This is an interesting problem. There are a few different ways to approach it.

In-context replacement

In-context replacement is probably easier, so let's start with it.

Awk

The idea with awk is to use match() repeatedly to find all the instances of the context, and perform the replacements only on them. In our example, the contexts are all the bracketed blocks, so:

{
  newline = ""
  while(match($0, /\[[^]]*\]/) > 0) {
    newline = newline substr($0, 1, RSTART - 1)
    context = substr($0, RSTART, RLENGTH)
    gsub(/FOO/, "BAR", context)
    newline = newline context
    $0 = substr($0, RSTART + RLENGTH)
  }
  newline = newline $0
  print newline
}

The variable newline contains the changed line, which is built up gradually. Parts of the original lines that are not touched are added to newline as they are, while contexts are added after FOO having been replaced with BAR. At the end, newline is printed. Let's build a simple test file (which will be used throughout the examples), and test it. Note that for simplicity we're NOT considering nested contexts, which rapidly become very hard to parse using regular expressions (in the example, that would be blocks containing bracketed subblocks). We're also deliberately ignoring the case where context-providing characters can appear (perhaps escaped somehow) in some other place and thus should be ignored for our purposes.

$ cat sample.txt
abcd FOO efgh [this FOO is in brackets] ijkl FOO nmop [another FOO in brackets] blah
abcd FOO efgh
[this FOO is in brackets]
ijkl FOO mnop [another FOO in brackets] blah
efgh [normal text in brackets] ijkl mnop [another normal text] blah
FOO FOO [FOO FOO][FOO FOO]FOO FOO
[FOO]
[FOO]ijkl
$ awk -f incontext.awk sample.txt
abcd FOO efgh [this BAR is in brackets] ijkl FOO nmop [another BAR in brackets] blah
abcd FOO efgh
[this BAR is in brackets]
ijkl FOO mnop [another BAR in brackets] blah
efgh [normal text in brackets] ijkl mnop [another normal text] blah
FOO FOO [BAR BAR][BAR BAR]FOO FOO
[BAR]
[BAR]ijkl

sed

With sed, we use a loop and keep replacing FOOs that appear in a context:

$ sed ':loop; s/\(\[[^]]*\)FOO\([^]]*\]\)/\1BAR\2/; t loop' sample.txt
abcd FOO efgh [this BAR is in brackets] ijkl FOO nmop [another BAR in brackets] blah
abcd FOO efgh
[this BAR is in brackets]
ijkl FOO mnop [another BAR in brackets] blah
efgh [normal text in brackets] ijkl mnop [another normal text] blah
FOO FOO [BAR BAR][BAR BAR]FOO FOO
[BAR]
[BAR]ijkl

Note that this will not work if the replacement string contains the matched text (ie FOO here); that would lead to an endless loop.

Perl

Perl is the most powerful of the bunch, so we can do the replacement directly on each matched context with the help of the /e switch (for eval) to the replacement:

$ perl -pe 's/\[.*?\]/($a=$&)=~s%FOO%BAR%g;$a/eg' sample.txt
abcd FOO efgh [this BAR is in brackets] ijkl FOO nmop [another BAR in brackets] blah
abcd FOO efgh
[this BAR is in brackets]
ijkl FOO mnop [another BAR in brackets] blah
efgh [normal text in brackets] ijkl mnop [another normal text] blah
FOO FOO [BAR BAR][BAR BAR]FOO FOO
[BAR]
[BAR]ijkl

Also note that Perl's regular expressions are able to match contexts that would be difficult or impossible to match with standard awk/sed REs (think non-greedy quantifiers or lookaround). The example uses a simple context (brackets) so it's possible to use all the tools.

Out of context

This is a bit harder to accomplish, and in some cases we must resort to dirty tricks.

awk

Looking closely at the awk in-context solution, we see that during the loop we see both the contexts and the out-of-context data, alternatively. So all we need is to perform the replacements on the out-of-context data instead of the in-context one. So the solution is almost the same as the one for in-context replacement:

{
  newline = ""
  while(match($0, /\[[^]]*\]/) > 0) {
    outofcontext = substr($0, 1, RSTART - 1)
    gsub(/FOO/, "BAR", outofcontext)
    newline = newline outofcontext
    context = substr($0, RSTART, RLENGTH)
    newline = newline context
    $0 = substr($0, RSTART + RLENGTH)
  }
  gsub(/FOO/, "BAR")
  newline = newline $0
  print newline
}

$ awk -f outofcontext.awk sample.txt
abcd BAR efgh [this FOO is in brackets] ijkl BAR nmop [another FOO in brackets] blah
abcd BAR efgh
[this FOO is in brackets]
ijkl BAR mnop [another FOO in brackets] blah
efgh [normal text in brackets] ijkl mnop [another normal text] blah
BAR BAR [FOO FOO][FOO FOO]BAR BAR
[FOO]
[FOO]ijkl

sed

The idea here is that contexts are removed from the line and stored away, the replacement is done on what's left (which thus must be the out-of-context data), and finally the contexts are restored to their original positions. Of course, to "remember" where the removed contexts are, we should use some sort of placeholder character.
So we put contexts in the hold space (separated by a ASCII 1 character), and we use an ASCII 1 in the original line to mark a spot where a context has to be reinserted after the replacements.

h                   # save line to hold space

# remove non-contexts (ie, leave only contexts separated by \x1)
s/^[^[]*\[/[/
s/\][^]]*$/]\x1/
s/\][^[]*\[/]\x1[/g

# swap hold/pattern space to get the original line in pattern space
x

# remove contexts (ie, leave only non-contexts separated by \x1)
s/\[[^]]*\]/\x1/g

# do the actual replacement
s/FOO/BAR/g

# append hold space to pattern space, this gives <patternspace>\n<holdspace> in pattern space
G

# reinsert contexts where they belong
:loop
s/\x1\(.*\)\n\([^\x1]*\)\x1/\2\1\n/
t loop

# remove leftover stuff
s/\n.*//

Not the most straightforward way, but in these cases sed is a bit limited. I probably wouldn't recommend to use sed for this task.

With a sed that supports EREs like GNU sed (which is probably needed anyway to use \x1 as in the other solution above), there is also the option of using a loop, similar to the in-context solution:

sed -r ':loop; s/((^|\])[^[]*)FOO([^[]*($|\[))/\1BAR\3/; t loop' sample.txt

This has the same problem as the in-context solution (the replacement can't contain the pattern), and also leads us directly to the Perl solution.

Perl

With Perl, again, it's quite easy:

$ perl -pe 's/(?:^|\]).*?(?:$|\[)/($a=$&)=~s%FOO%BAR%g;$a/eg' sample.txt
abcd BAR efgh [this FOO is in brackets] ijkl BAR nmop [another FOO in brackets] blah
abcd BAR efgh
[this FOO is in brackets]
ijkl BAR mnop [another FOO in brackets] blah
efgh [normal text in brackets] ijkl mnop [another normal text] blah
BAR BAR [FOO FOO][FOO FOO]BAR BAR
[FOO]
[FOO]ijkl

Essentially the idea is the same as before, but this time we are matching all the out-of-context parts (that is, from either beginning of line or "]" to either end of line or "[").

Filed under awk, faq, sed, shell, tips Tagged awk, perl, sed, text processing

Comments are closed | Permalink

\1