Just three problems that came up in different circumstances in the last couple of months.
Ranges, again
Ranges strike again, this time the task is to print or select everything from the first occurrence of /START/ in the input to the last occurrence of /END/, including the extremes or not. So, given this sample input:
1 xxxx 2 xxxx 3 END 4 aaa 5 START 6 START 7 zzz 8 START 9 hhh 10 END 11 ppp 12 END 13 mmm 14 START
we want to match from line 5 to 12 (or from line 6 to 11 in the noninclusive version).
The logic is something along the lines of: when /START/ is seen, start collecting lines. Each time an /END/ is seen (and /START/ was previously seen), print what we have so far, empty the buffer and start collecting lines again, in case we see another /END/ later.
Here's an awk solution for the inclusive case:
awk '!ok && /START/ { ok = 1 } ok { p = p sep $0; sep = RS } ok && /END/ { print p; p = sep = "" }' file.txt
and here's the noninclusive case, which is mostly the same code with the order of the blocks reversed:
awk 'ok && /END/ { if (content) print p; p = sep = "" } ok { p = p sep $0; sep = RS; content = 1 } !ok && /START/ { ok = 1 }' file.txt
The "content" variable is necessary for the obscure corner case in which the input contains something like
... START END ...
If we relied upon "p" not being empty to decide whether to print or not, this case would be indistinguishable from this other one:
... START END ...
We could also (perhaps a bit cryptically) avoid the extra variable and rely on "sep" being set instead. We keep the extra variable for the sake of clarity.
Here are two sed solutions implementing the same logic (not really recommended, but since the original request was to solve this with sed). The hold buffer is used to accumulate lines.
Inclusive:
# sed -n # from first /START/ to last /END/, inclusive version /START/ { H :loop $! { n H # if we see an /END/, sanitize and print /END/ { x s/^\n// p s/.*// x } bloop } }
The noninclusive version uses the same logic, except we discard the first /START/ line that we see (done by the "n" in the loop), and, when we see an /END/, we print what we have so far (which crucially does not include the /END/ line itself, which however is included for the next round of accumulation).
# sed -n # from first /START/ to last /END/, noninclusive version /START/ { :loop $! { n /END/ { # recover lines accumulated so far x # if there something, print /./ { # remove leading \n added by H s/^\n// p } # empty the buffer s/.*// # recover the /END/ line for next round x } H bloop } }
Note that the above solutions assume that no line exists that match both /START/ and /END/. Other solutions are of course possible.
Conditional line join
In this case we have some special lines (identified by a pattern). Every time a special line is seen, all the previous or following lines should be joined to it. An example to make it clear, using /SPECIAL/ as our pattern:
SPECIAL 1 line2 line3 SPECIAL 2 line5 line6 line7 SPECIAL 3 SPECIAL 4 line10 SPECIAL 5
So we want one of the two following outputs, depending on whether we join the special lines to the preceding or the following ones:
# join with following lines SPECIAL 1 line2 line3 SPECIAL 2 line5 line6 line7 SPECIAL 3 SPECIAL 4 line10 SPECIAL 5
# join with preceding lines SPECIAL 1 line2 line3 SPECIAL 2 line5 line6 line7 SPECIAL 3 SPECIAL 4 line10 SPECIAL 5
The sample input has been artificially crafted to work with both types of change; in practice, in real inputs either the first or the last line won't match /SPECIAL/, depending on the needed processing.
So here's some awk code that joins each special line with the following ones, until a new special line is found, thus producing the first of the two output shown above:
awk -v sep=" " '/SPECIAL/ && done == 1 { print "" s = "" done = 0 } { printf "%s%s", s, $0 s = sep done = 1 } END { if (done) print"" }' file.txt
And here's the idiomatic solution to produce the second output (join with preceding lines):
awk -v sep=" " '{ ORS = /SPECIAL/ ? RS : sep }1' file.txt
The variable "sep" should be set to the desired separator to be used when joining lines (here it's simply a space).
Intra-block sort
(for want of a better name)
Let's imagine an input file like
alpha:9832 alpha:11 alpha:449 delta:23847 delta:113 gamma:1 gamma:10 gamma:100 gamma:101 beta:5768 beta:4
The file has sections, where the first field names the section (alpha, beta etc.). Now we want to sort each section according to its second field (numeric), but without changing the overall order of the sections. In other words, we want this output:
alpha:11 alpha:449 alpha:9832 delta:113 delta:23847 gamma:1 gamma:10 gamma:100 gamma:101 beta:4 beta:5768
As a variation, blocks can be separated by a blank line, as follows:
alpha:9832 alpha:11 alpha:449 delta:23847 delta:113 gamma:1 gamma:10 gamma:100 gamma:101 beta:5768 beta:4
So the corresponding output should be
alpha:11 alpha:449 alpha:9832 delta:113 delta:23847 gamma:1 gamma:10 gamma:100 gamma:101 beta:4 beta:5768
Shell
The blatantly obvious solution using the shell is to number each section adding a new field at the beginning, then sort according to field 1 + field 3, and finally print the result removing the extra field that we added:
awk -F ':' '$1 != prev {count++} {prev = $1; print count FS $0}' file.txt | sort -t ':' -k1,1n -k3,3n | awk -F ':' '{print substr($0,index($0,FS)+1)}' alpha:11 alpha:449 alpha:9832 delta:113 delta:23847 gamma:1 gamma:10 gamma:100 gamma:101 beta:4 beta:5768
Instead of reusing awk, the job of the last part of the pipeline could have been done for example with cut or sed.
For the variation with separated blocks, an almost identical solution works. Paragraphs are numbered prepending a new field, the result sorted, and the prepended numbers removed before printing:
awk -v count=1 '/^$/{count++}{print count ":" $0}' file.txt | sort -t ':' -k1,1n -k3,3n | awk -F ':' '{print substr($0,index($0,FS)+1)}' alpha:11 alpha:449 alpha:9832 delta:113 delta:23847 gamma:1 gamma:10 gamma:100 gamma:101 beta:4 beta:5768
A crucial property of this solution is that empty lines are always thought as being part of the next paragraph (not the previous), so when sorting they remain where they are. This also means that runs of empty lines in the input are preserved in the output.
Perl
The previous solutions treat the input as a single entity, regardless of how many blocks it has. After preprocessing, sort is applied to the whole data, and if the file is very big, many temporary resources (disk, memory) are needed to do the sorting.
Let's see if it's possible to be a bit more efficient and sort each block independently.
Here is an example with perl that works with both variations of the input (without and with separated blocks).
#!/usr/bin/perl use warnings; use strict; sub printblock { print $_->[1] for (sort { $a->[0] <=> $b->[0] } @_); } my @block = (); my ($prev, $cur, $val); while(<>){ my $empty = /^$/; if (!$empty) { ($cur, $val) = /^([^:]*):([^:]*)/; chomp($val); } if (@block && ($empty || $cur ne $prev)) { printblock(@block); @block = (); } if ($empty) { print; } else { push @block, [ $val, $_ ]; $prev = $cur; } } printblock(@block) if (@block);
Of course all the sample code given here must be adapted to the actual input format.