Three text processing tasks

Posted by waldner on 14 September 2014, 9:47 am

Just three problems that came up in different circumstances in the last couple of months.

Ranges, again

Ranges strike again, this time the task is to print or select everything from the first occurrence of /START/ in the input to the last occurrence of /END/, including the extremes or not. So, given this sample input:

 1 xxxx
 2 xxxx
 3 END
 4 aaa
 5 START
 6 START
 7 zzz
 8 START
 9 hhh
10 END
11 ppp
12 END
13 mmm
14 START

we want to match from line 5 to 12 (or from line 6 to 11 in the noninclusive version).

The logic is something along the lines of: when /START/ is seen, start collecting lines. Each time an /END/ is seen (and /START/ was previously seen), print what we have so far, empty the buffer and start collecting lines again, in case we see another /END/ later.

Here's an awk solution for the inclusive case:

awk '!ok && /START/ { ok = 1 }
ok { p = p sep $0; sep = RS }
ok && /END/ { print p; p = sep = "" }' file.txt

and here's the noninclusive case, which is mostly the same code with the order of the blocks reversed:

awk 'ok && /END/ { if (content) print p; p = sep = "" }
ok { p = p sep $0; sep = RS; content = 1 }
!ok && /START/ { ok = 1 }' file.txt

The "content" variable is necessary for the obscure corner case in which the input contains something like

...
START

END
...

If we relied upon "p" not being empty to decide whether to print or not, this case would be indistinguishable from this other one:

...
START
END
...

We could also (perhaps a bit cryptically) avoid the extra variable and rely on "sep" being set instead. We keep the extra variable for the sake of clarity.

Here are two sed solutions implementing the same logic (not really recommended, but since the original request was to solve this with sed). The hold buffer is used to accumulate lines.
Inclusive:

# sed -n
# from first /START/ to last /END/, inclusive version

/START/ {
  H
  :loop
  $! {
    n
    H
    # if we see an /END/, sanitize and print
    /END/ {
      x
      s/^\n//
      p
      s/.*//
      x
    }
    bloop
  }
}

The noninclusive version uses the same logic, except we discard the first /START/ line that we see (done by the "n" in the loop), and, when we see an /END/, we print what we have so far (which crucially does not include the /END/ line itself, which however is included for the next round of accumulation).

# sed -n
# from first /START/ to last /END/, noninclusive version

/START/ {
  :loop
  $! {
    n
    /END/ {
      # recover lines accumulated so far
      x

      # if there something, print
      /./ {
        # remove leading \n added by H
        s/^\n//
        p
      }

      # empty the buffer
      s/.*//

      # recover the /END/ line for next round
      x
    }
    H
    bloop
  }
}

Note that the above solutions assume that no line exists that match both /START/ and /END/. Other solutions are of course possible.

Conditional line join

In this case we have some special lines (identified by a pattern). Every time a special line is seen, all the previous or following lines should be joined to it. An example to make it clear, using /SPECIAL/ as our pattern:

SPECIAL 1
line2
line3
SPECIAL 2
line5
line6
line7
SPECIAL 3
SPECIAL 4
line10
SPECIAL 5

So we want one of the two following outputs, depending on whether we join the special lines to the preceding or the following ones:

# join with following lines
SPECIAL 1 line2 line3
SPECIAL 2 line5 line6 line7
SPECIAL 3
SPECIAL 4 line10
SPECIAL 5

# join with preceding lines
SPECIAL 1
line2 line3 SPECIAL 2
line5 line6 line7 SPECIAL 3
SPECIAL 4
line10 SPECIAL 5

The sample input has been artificially crafted to work with both types of change; in practice, in real inputs either the first or the last line won't match /SPECIAL/, depending on the needed processing.

So here's some awk code that joins each special line with the following ones, until a new special line is found, thus producing the first of the two output shown above:

awk -v sep=" " '/SPECIAL/ && done == 1 {
  print ""
  s = ""
  done = 0
}
{
  printf "%s%s", s, $0
  s = sep
  done = 1
}
END {
  if (done) print""
}' file.txt

And here's the idiomatic solution to produce the second output (join with preceding lines):

awk -v sep=" " '{ ORS = /SPECIAL/ ? RS : sep }1' file.txt

The variable "sep" should be set to the desired separator to be used when joining lines (here it's simply a space).

Intra-block sort

(for want of a better name)

Let's imagine an input file like

alpha:9832
alpha:11
alpha:449
delta:23847
delta:113
gamma:1
gamma:10
gamma:100
gamma:101
beta:5768
beta:4

The file has sections, where the first field names the section (alpha, beta etc.). Now we want to sort each section according to its second field (numeric), but without changing the overall order of the sections. In other words, we want this output:

alpha:11
alpha:449
alpha:9832
delta:113
delta:23847
gamma:1
gamma:10
gamma:100
gamma:101
beta:4
beta:5768

As a variation, blocks can be separated by a blank line, as follows:

alpha:9832
alpha:11
alpha:449

delta:23847
delta:113

gamma:1
gamma:10
gamma:100
gamma:101

beta:5768
beta:4

So the corresponding output should be

alpha:11
alpha:449
alpha:9832

delta:113
delta:23847

gamma:1
gamma:10
gamma:100
gamma:101

beta:4
beta:5768

Shell

The blatantly obvious solution using the shell is to number each section adding a new field at the beginning, then sort according to field 1 + field 3, and finally print the result removing the extra field that we added:

awk -F ':' '$1 != prev {count++} {prev = $1; print count FS $0}' file.txt | sort -t ':' -k1,1n -k3,3n | awk -F ':' '{print substr($0,index($0,FS)+1)}'
alpha:11
alpha:449
alpha:9832
delta:113
delta:23847
gamma:1
gamma:10
gamma:100
gamma:101
beta:4
beta:5768

Instead of reusing awk, the job of the last part of the pipeline could have been done for example with cut or sed.

For the variation with separated blocks, an almost identical solution works. Paragraphs are numbered prepending a new field, the result sorted, and the prepended numbers removed before printing:

awk -v count=1 '/^$/{count++}{print count ":" $0}' file.txt | sort -t ':' -k1,1n -k3,3n | awk -F ':' '{print substr($0,index($0,FS)+1)}'
alpha:11
alpha:449
alpha:9832

delta:113
delta:23847

gamma:1
gamma:10
gamma:100
gamma:101

beta:4
beta:5768

A crucial property of this solution is that empty lines are always thought as being part of the next paragraph (not the previous), so when sorting they remain where they are. This also means that runs of empty lines in the input are preserved in the output.

Perl

The previous solutions treat the input as a single entity, regardless of how many blocks it has. After preprocessing, sort is applied to the whole data, and if the file is very big, many temporary resources (disk, memory) are needed to do the sorting.

Let's see if it's possible to be a bit more efficient and sort each block independently.

Here is an example with perl that works with both variations of the input (without and with separated blocks).

#!/usr/bin/perl

use warnings;
use strict;

sub printblock {
  print $_->[1] for (sort { $a->[0] <=> $b->[0] } @_);
}

my @block = ();
my ($prev, $cur, $val);

while(<>){

  my $empty = /^$/;

  if (!$empty) {
    ($cur, $val) = /^([^:]*):([^:]*)/;
    chomp($val);
  }

  if (@block && ($empty || $cur ne $prev)) {
    printblock(@block);
    @block = ();
  }

  if ($empty) {
    print;
  } else {
    push @block, [ $val, $_ ];
    $prev = $cur;
  }
}

printblock(@block) if (@block);

Of course all the sample code given here must be adapted to the actual input format.

Filed under awk, sed, shell Tagged awk, blocks, perl, ranges, sed, sorting, text processing

Comments are closed | Permalink

\1