Replace every Nth occurrence

Posted by waldner on 22 August 2012, 2:20 pm

Problem statement: replace every Nth occurrence of a pattern. For bonus points, provide context to the match.

For this article, we're going to use this example (mostly nonsensical, but that illustrates the concept):

1foo1 bar foo 2foo3 abc def foo foo foo 1foo4 foo foo 3foo9 4foo7 zzz 7foo7 3foo3

And we want, among all matches of /[0-9]foo[0-9]/, to replace the "foo" with "FOO" on every third match (read it again if it's not clear). So the output we want is:

9foo2 bar foo 2foo3 abc def foo foo foo 1FOO4 foo foo 3foo9 4foo7 zzz 7FOO7 3foo3

That is, the third and sixth match get their "foo" part replaced with FOO.

Perl

Perl is the easy winner here due to its ability to evaluate code directly in the replacement part:

$count = 0; s/(\d)foo(\d)/(++$count % 3 == 0)?"$1FOO$2":$&/ge;

So in the RHS, if the number of the match we're seeing is a multiple of 3 we replace foo with FOO, otherwise we replace the match iwth itself. Obviously, by changing the test on $count we can control exactly which matches we act upon. Depending on the exact need, we could also employ the replacement operator directly on the matched part:

$count = 0; s/(\d)foo(\d)/$a=$&; $a =~ s{foo}{FOO} if (++$count % 3 == 0); $a/ge;

If no context is needed, things can be simplified (eg no capturing and replaying of the digits before and after foo):

$count = 0; s/foo/(++$count % 3 == 0)?"FOO":$&/ge;

But of course, due to the lack of context, this does a different thing on the input:

1foo1 bar foo 2FOO3 abc def foo foo FOO 1foo4 foo FOO 3foo9 4foo7 zzz 7FOO7 3foo3

So (as usual when working with regular expressions) one should know exactly their data and what kind of processing is needed before choosing which solution to use.

Awk

While not as straightforward as Perl, awk can indeed be used successfully for this kind of task. It just takes a bit more of code:

{
  count = 0
  newline = ""
 
  while(match($0, /[0-9]foo[0-9]/) > 0) {
    count++
    newline = newline substr($0, 1, RSTART - 1)
    matched = substr($0, RSTART, RLENGTH)
    $0 = substr($0, RSTART + RLENGTH)
 
    if (count % 3 == 0) {
 
      # simple sub(), but see text below
      sub(/foo/, "FOO", matched)
    }
 
    newline = newline matched
  }
  newline = newline $0
  print newline
}

Here we're doing a simple sub() on the matched part, but depending on the exact task we may need to extract/save and restore the context, possibly running match() again on the matched part, for example:

{
  count = 0
  newline = ""
 
  while(match($0, /[0-9]foo[0-9]/) > 0) {
    count++
    newline = newline substr($0, 1, RSTART - 1)
    matched = substr($0, RSTART, RLENGTH)
 
    $0 = substr($0, RSTART + RLENGTH)
 
    if (count % 3 == 0) {
      match(matched, /^[0-9]/)
      prematch = substr(matched, RSTART, RLENGTH)
      matched = substr(matched, RLENGTH + 1)
      match(matched, /[0-9]$/)
      postmatch = substr(matched, RSTART, RLENGTH)
      matched = substr(matched, 1, RSTART - 1)
      sub(/foo/, "FOO", matched)
 
      # restore context
      matched = prematch matched postmatch
    }
 
    newline = newline matched
  }
  newline = newline $0
  print newline
}

Here the context is just a single leading/trailing digit so it could have been done by just taking the first and last character of matched, but hopefully the above illustrates the general concept of finding, extracting and restoring the parts that make up the context, which can be quite lengthy with awk.

Again if we don't need the context, most of the hassle just goes away:

{
  count = 0
  newline = ""
 
  while(match($0, /foo/) > 0) {
    count++
    newline = newline substr($0, 1, RSTART - 1)
    matched = substr($0, RSTART, RLENGTH)
    $0 = substr($0, RSTART + RLENGTH)
 
    if (count % 3 == 0) {
      sub(/foo/, "FOO", matched)
      # or here even just
      # matched = "FOO"
    }
 
    newline = newline matched
  }
  newline = newline $0
  print newline
}

NOTE: be careful not to look for matches that can be zero-length with awk, because they can cause an endless loop.

Sed

Sed is definitely NOT recommended for this task. As a divertissement, here's a sed solution using markers for the much simpler case of N == 2 (every other match), without bothering for contexts (needs GNU sed):

sed 's/foo/\x1&\x2/g; s/\([^\x1]*[\x1][^\x2]*[\x2][^\x1]*\)[\x1][^\x2]*[\x2]/\1FOO/g; s/[\x1\x2]//g'

Filed under awk, faq, shell Tagged awk, perl, text processing

Comments are closed | Permalink

2 Comments

aseel says:

January 16, 2015 at 16:14

Hi I'm new to using linux and I want to implement this tutorial to replace every 4th entry of a text. where do we put in the input file name and how do we get it to print the results in a different output file.

Thanks
- waldner says:
  
  January 18, 2015 at 18:35
  
  Using the perl example, assuming you want to replace every fourth "foo" with "bar":
```
perl -e '$count = 0; s/foo/(++$count % 4 == 0)?"bar":$&/ge;' inputfile.txt > outputfile.txt
```

\1