Working with blocks in sed

Posted by waldner on 31 May 2010, 5:23 pm

Assume the input has some kind of "blocks", of some length, that could potentially span multiple lines. The goal is to recognize them, and do some operations on them (replacement, deletion and so on). The difficult part is to recognize when a block ends.
For these examples, we will keep it simple and the task will be to print the blocks only if they match a pattern like /^ERROR: .*Error code is 12/. The pattern can possibly match over multiple lines, as the .* does match a newline.

Fixed length blocks

This is the easiest one. If we know that each block is, say, 4 lines long, then it's quite easy to do this:

N;N;N; /^ERROR: .*Error code is 12/!d

The above assumes that the input length is an exact multiple of the block length, as no check for end of input is performed. If you need to do that, you have to do something like

$!N;$!N;$!N; /^ERROR: .*Error code is 12/!d

If the block is 20 lines long and you don't want to write 19 "N" commands, you can use a loop:

:loop; $!{N; s/\n/&/19; t done; b loop;}; :done; /^ERROR: .*Error code is 12/!d

This last form also detects the end of the input (if it's not the same as the end of a block) and processes the (potentially smaller) last block in the same way.

Variable length blocks

Here's a sample input with variable length blocks:

INFO: nothing here
WARNING: and neither here
but this is multiline
ERROR: here is
an interesting error
Error code is 12
and this is a block to print
ERROR: here is
another block to print
Error code is 12
the failure is not serious
INFO: nothing here
but a multiline nothing
INFO: nothing here
WARNING: and neither here
ERROR: this is an error
Error code is 18
this block should not be printed
INFO: nothing here

The catch here is that we don't know that a block is completed until we read in the beginning of a new block (or we are at the end of the input). So what we do is accumulate block lines in the hold space, and when we encounter an input line that is the start of a new block, we process the previous block we have in the hold space. Additionally, we also want to process the hold space when we are at the end of the input, to process the very last block.
To make it more interesting, let's assume a block starts with either /^INFO: /, /WARNING: /, or /ERROR: /, and again, we want to print only error blocks matching our pattern that gives us errors with code 12.

Here's a sample code:

# if the input starts a new block, process the previous one
/^INFO: /b process
/^WARNING: /b process
/^ERROR: /b process
# append the current line to the hold space
:append
H
${
  # if we are at the last line, force processing of the last
  # block, but use a special marker in the pattern space
  # to signal we are done after processing
  s/.*/\n/
  b process
}
d
:process
x
# if the hold space is empty (ie beginning of input), do nothing
/^$/b end
# remove leading \n added by H
s/^\n//
# if the block is interesting, print it
/^ERROR: .*Error code is 12/p
# empty it (not with "d"!)
s/.*//
:end
x
# if not the real end, go back and append
/^\n$/d
b append

Save it in a file, say block.sed, and run it on the sample input:

$ sed -f block.sed sample.txt
ERROR: here is
an interesting error
Error code is 12
and this is a block to print
ERROR: here is
another block to print
Error code is 12
the failure is not serious

Note that this last method can handle "blocks" of any length (including a single line), as long as the pattern that identifies the start of a block is correct.

PS: the usual nitpicks about sed syntax apply.

A future article will examine how to accomplish the same task using awk.

Filed under faq, sed, shell, tips Tagged blocks, oneliners, sed, shell, text processing

Comments are closed | Permalink

\1